# Personal project - Data Visualization with `plotly.express`

For the subject of the visualization project is took the [Kickstarter Project](https://www.mavenanalytics.io/data-playground?search=Kickstarter%20Projects) from Maven Analytics. The data list 375,000+ Kickstarter projects from 2009-2017. 

## Prep

In [None]:
!pip install jupyter-dash

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#Import libraries
import pandas as pd
import plotly
import plotly.express as px


In [None]:
#Import dash library
from dash import dcc
from dash import html
from dash import Input, Output
from jupyter_dash import JupyterDash

The data columns consist of ['ID',
 'Name',
 'Category',
 'Subcategory',
 'Country',
 'Launched',
 'Deadline',
 'Goal',
 'Pledged',
 'Backers',
 'State'].
 
 Since the original database has 350+k rows, a subsample is used to avoid high computational necessity. 

In [None]:
#Reading in the dataset
df = pd.read_csv("https://raw.githubusercontent.com/gabizsiros/DV3_Python/main/kickstarter_projects.csv")
df_sub = df.sample(n=15000, random_state=42) #random subampling
df_sub.head()

## 1. Bar chart
The first visualization shows the first 10 most successful campains. Naturally, their state is 'Succesful' in all cases. The visualization shows the original goal - where it is visible in comparison with the achieved funds. In order to visualize, a subset needs to be created by sorting the original dataframe based on the Pledged amount.  
It is important to note that it is not a simple stacked bar chart, since stacking would mean that the amount for `Goal` and `Pledged` is summed, but in this case, one data series is a subset of the other, so the display solution is overlay bars.

Adding text to the bars would result in the text to be added twice, since the plot has two x series. It can be avoided with updating just 1 variable with that is defined with the selector argument. 

In [None]:
subdf1 = df.sort_values("Pledged", ascending= False).head(10)

color_map = {"Pledged": "lightgreen", "Goal": "darkgreen"}

# Create bar plot with two layers
fig1= px.bar(subdf1, y="Name", x=["Pledged", "Goal"],
             title='Most successful <i>Kickstarter</i> campaigns<br><sub>Top 10 campaings and their categories based on Pledged amount</sub>',
             labels = {"Name": "", "variable":""},
             barmode='overlay', opacity=1,
             color_discrete_map =color_map)

fig1.update_layout(title_x = 0.5, plot_bgcolor='white',xaxis= {'gridcolor':'lightgrey'})
fig1.update_yaxes(autorange="reversed")

fig1.update_traces(text=subdf1['Subcategory'], selector=dict(name='Pledged'))

## 2. Histogram  
To break down the raised amount per category, aggregation is required, for which a histogram is better fit instead of a bar chart. 

Coloring the columns according to Countires is a way of showing the overwhelming presence of United States, and to display what other countries are prominent. This only works, when the dataset contains a limited number of countires, such as in this case. 

In [None]:
# Create histogram with sorted categories
fig2 = px.histogram(df_sub, x='Category', y='Pledged', 
                    color = 'Country' , 
                    #category_orders={'Category': category_totals['Category']},
                    title='<i>Kickstarter</i> campaigns by categories')

fig2.update_layout(title_x=0.5, plot_bgcolor='white',yaxis= {'gridcolor':'lightgrey'})
fig2.update_yaxes(title='Total pledged amount')

## 3. Line
_Actionable: Category_
A simple line chart shows the overall raised money  for certain categories. In order to properly display time series, it is important that the date columns are converted to date-time format. To display the overall growh, the features need to be engineered so an additional column traces the cumulative amoun day by day.

In [None]:
# Convert Date column to datetime format
df_sub['Launched'] = pd.to_datetime(df_sub['Launched'])

# Group data by date and sum the pledged amount
subdf3 = df_sub.groupby([ 'Category','Launched'])['Pledged'].sum().reset_index()
#subdf3 = df_sub.query('Category == "Journalism"')

# Sort data by date
subdf3 = subdf3.sort_values('Launched')

# Add a new column with the cumulated sum of the pledged amount
subdf3['Cumulative Pledged'] = subdf3['Pledged'].cumsum()

# Create line chart with date on x-axis and cumulative pledged amount on y-axis
fig3 = px.line(subdf3, x='Launched', y='Cumulative Pledged', title='Cumulative Pledged Amount over Time')

# Update x-axis format
fig3.update_layout(xaxis_tickformat='%b %d, %Y',  plot_bgcolor='white',title_x = 0.5,
                   xaxis= {'gridcolor':'lightgrey'},yaxis= {'gridcolor':'lightgrey'})


## 4. Faceted scatter
_Actionable: Pledged max_  

For scatterplot, a categorical comparison between the Goal of a project and the actual pledged amount can be visualized. The data contains some very extreme outlliers as already shown in the first graph, so an important step is to limit the graph range and try to exclude the outliser as much as possible. Otherwise most of the data will be crammed in the lower right corner, making is very hard to read. For demonstarion, the limit is set to 1M. 

In this case, the objective for the visualization is less to show correlation between the two variables, but more to draw some visual conclusions. By separating the colors according to the state of the project (failed, or successful), the approxiamte ratio vetween successful and failed project can be shown in each Category. To aide the separation, a 45° line can be added, which in some cases (like Film & Video) is already defined by the datapoints.
Another difference is simply the amount of scatterpoints that makes the comparison of volume between categories easy. 

The background color is kept as grey to make gridlines visible. (Updating gridlines seems to have an effect only on the bottom left plot). Width needs to be hardcoded, otherwise the whole plot would strech and override the 1:1 axis ratio.

In [None]:
subdf4 = df.query('Pledged < '+str(1000000))
fig4 = px.scatter(subdf4, x='Goal', y='Pledged', title='Goal vs. Pledged per Category', 
                      color = "State", facet_col='Category', facet_col_wrap=3, 
                  height = 1200, width = 700, range_x = [0,1000000], range_y = [0,1000000] )

fig4.update_layout(title_x = 0.5,
                   #plot_bgcolor='white', 
                   xaxis= {"scaleanchor":"y","scaleratio":1, 'gridcolor': 'lightgrey'},
                   yaxis= {"scaleanchor":"y","scaleratio":1, 'gridcolor': 'lightgrey'}
                  )
#Remove the "Category=" annotation with a loop 
fig4.for_each_annotation(lambda a: a.update(text=a.text.replace("Category=", "")))

# Add a 45-degree line looping through the 3x5 grid
for i in range(1, 16):
    fig4.add_shape(type="line",
                  x0=df["Goal"].min(),
                  y0=df["Goal"].min(),
                  x1=df["Goal"].max(),
                  y1=df["Goal"].max(),
                  line=dict(color="black", width=1, dash="dash"),
                  row=int((i-1)/3)+1, col=i%3 if i%3 != 0 else 3)
fig4.show()

## 5. Treemap
_Actionable: Country, aggregation method_  
Treemap is an exicting way to visualize the differences and ratios of features, in this case, the categories. Country dimension can be added tby query, to see the distribution in each country. One way to aggregate, is to sum the amount of funds collected, another is to count the projects. Example with United Kingdom.
To enable interactivity for Dash at later stage, two subsets were created according to the aggregation method

The coloring completely correspond with the squares sizes, so it simpley helpd interpret the smmall difference between values through shading. Since the the treemap per Country take Categories as variable, it only works if the queried country does not have 0 as Pledge value per Category. To account for this scenario, error handling is built into the code at this stage.

In [None]:
subdf5 = df.query('Country == "United Kingdom"')

#aggregating by sum
subdf5A = subdf5.groupby('Category').sum().reset_index()
#aggregating by count
subdf5B = subdf5.groupby('Category').count().reset_index()

#two titles as a variable, which can be changed with interactivity
titles = ['Pledged amount by Category', 'Number of projects by Category']

# Create the treemap for sum aggregation with error handling
try:
    fig5 = px.treemap(subdf5A, path=['Category', 'Pledged'], values='Pledged', 
                     color='Pledged', 
                     title= titles[0])
        
    fig5.update_layout(plot_bgcolor='white', title_x = 0.5)
    fig5.update_traces(texttemplate= '%{value:,}')
    fig5.show()
except ZeroDivisionError:
    print("Error: There is a category with 0 value in the selected country dataset.")





The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.



## Dash
The dashboard collects all 5 figures, while enabling interactivity for some plots. Given the length, it is better to open the dash in separate tab. 

Notes on interactivity:
Out of the 5 plots, the last 3 of them have interactivity.
Within the `dash` chunk, the function that updates charts, takes 4 inputs.
- `input3` provides the Category for the line chart,
- `input4` provides the max amount of pledge
- `input5A` provides the Country category
- `input5B` select between two aggregation method: sum and count.

In [None]:
categories = df['Category'].unique().tolist()
countries = df['Country'].unique().tolist()
countries.remove('Japan') #Japan needs to be removed to avoid error due to 0 Pledge

In [None]:
style1={'textAlign': 'center', 'color': 'Navy', "fontFamily": "Arial"}
style2={'textAlign': 'left', 'color': 'Navy', "fontFamily": "Arial"}

app = JupyterDash(__name__)
app.layout = html.Div(
[
    html.H1("DataViz on Kickstarter Project dataset", style=style1),
    html.Div("Personal project on Dash", style=style1),
    html.P("Data visualization project based on the Kickstart Projects dataset from Maven Analytics.",style=style2),
    html.H2("1. Bar chart", style=style2),
    html.P("The 10 highest gaining project on Kickstarter (as of 2017), visualized by a horizontal bar chart.",style=style2),
    dcc.Graph( id='graph1', figure=fig1),
    html.H2("2. Histogram", style=style2),
    html.P(""""Distribution of money raised in each category is displayed in a Stacked Bar chart. 
    While there a bit more than 20 countries in the data, the largest countries based on pledges 
    (such as US, but even UK and Canada) are easily visible.""",style=style2),
    dcc.Graph( id='graph2', figure=fig2),
    html.H2("3. Line graph", style=style2),
    html.P("The overall raised money throughout the year is shown in a line graph in each Category.",style=style2),
    html.P("Choose a Category from the Dropdown!",style=style2),
    dcc.Dropdown( id= 'input3', options=categories, value = categories[0], style={"width": "30%", 'color': 'Navy', "fontFamily": "Arial"}),
    dcc.Graph( id='graph3', figure=fig3),
    html.H2("4. Scatterplot", style=style2),
    html.P("""The ratio of successful and failed campaigns are borken down in each category. The differences in number of projects in each category, as well as the magnitude of succeeded campaigns are clearly visible even without exact numbers. 
    For better visualization, a number should be selected between 1 million and 100 thousand, otherwise the outliers and the crammed-in values with distort the plot.       
    """,style=style2),
    html.Div("Select the max Pledge value in full number format:", style=style2),
    dcc.Input( id= 'input4', value = 1000000, style={"width": "30%", 'color': 'Navy', "fontFamily": "Arial"}),
    dcc.Graph( id='graph4', figure=fig4),
    html.H2("5. Treemap", style=style2),
    html.P("The treepmap demonstrates the populatiry of Kickstarer categories in each country with two kinds of aggregating methods.",style=style2),
    html.Div([
        html.P("Select a country:", style={"display": "inline-block",'color': 'Navy', "fontFamily": "Arial"}),
        dcc.RadioItems( id= 'input5A', options=countries, value = countries [0], style={"width": "100%",'color': 'Navy', "fontFamily": "Arial"}, inline = True),
        html.P("Select an aggregation method:", style={"display": "inline-block",'color': 'Navy', "fontFamily": "Arial"}),
        dcc.RadioItems( id= 'input5B', options= ["Sum","Count"], value = "Sum", style = style2,  inline = True),
        dcc.Graph( id='graph5', figure=fig5)
    ])
                 ]
)


@app.callback(
    Output('graph3', 'figure'),
    Output('graph4', 'figure'),
    Output('graph5', 'figure'),
    Input('input3', 'value'),
    Input('input4', 'value'),
    Input('input5A', 'value'),
    Input('input5B', 'value')
)



def update(_input3,_input4,_input5A,_input5B):
   #updating 3rd graph

    subdf3 = df_sub.query('Category == @_input3')
    subdf3 = subdf3.sort_values('Launched')
    subdf3['Cumulative Pledged'] = subdf3['Pledged'].cumsum()
    fig3 = px.line(subdf3, x='Launched', y='Cumulative Pledged', title='Cumulative Pledged Amount over Time')
    fig3.update_layout(xaxis_tickformat='%b %d, %Y',  plot_bgcolor='white',title_x = 0.5,
                   xaxis= {'gridcolor':'lightgrey'},yaxis= {'gridcolor':'lightgrey'})    

    
     #_input4 = _input4.astype('int64')
    
     #updating 4th graph
    subdf4 = df.query('Pledged < '+str(_input4))
    fig4 = px.scatter(subdf4, x='Goal', y='Pledged', title='Goal vs. Pledged per Category', 
                      color = "State", facet_col='Category', facet_col_wrap=3, 
                  height = 1200, width = 700, range_x = [0,_input4], range_y = [0,_input4] )

    fig4.update_layout(title_x = 0.5,
                   #plot_bgcolor='white', 
                   xaxis= {"scaleanchor":"y","scaleratio":1, 'gridcolor': 'lightgrey'},
                   yaxis= {"scaleanchor":"y","scaleratio":1, 'gridcolor': 'lightgrey'}
                  )
    
        # Add a 45-degree line loopding through the 3x5 grid
    for i in range(1, 16):
        fig4.add_shape(type="line",
                  x0=df["Goal"].min(),
                  y0=df["Goal"].min(),
                  x1=df["Goal"].max(),
                  y1=df["Goal"].max(),
                  line=dict(color="black", width=1, dash="dash"),
                  row=int((i-1)/3)+1, col=i%3 if i%3 != 0 else 3)
    
    
    #updating 5th graph
    
    subdf5 = df.query('Country == @_input5A')
    
    subdf5A = subdf5.groupby('Category').sum().reset_index()
    subdf5B = subdf5.groupby('Category').count().reset_index()
        
    titles = ['Pledged amount by Category', 'Number of projects by Category']
    
    fig5A = px.treemap(subdf5A, path=['Category', 'Pledged'], values='Pledged', 
                 color='Pledged', 
                 title= titles[0])
    
    fig5A.update_layout(plot_bgcolor='white', title_x = 0.5)
        
    fig5B = px.treemap(subdf5B, path=['Category', 'Pledged'], values='Pledged', 
                 color='Pledged', 
                 title= titles[1])
    
        
        #set default for fig5
    fig5 = fig5A
    if _input5B == "Sum":
           fig5 = fig5A
    elif _input5B == "Count":
           fig5 = fig5B
    
    fig5.update_layout(plot_bgcolor='white', title_x = 0.5)
    fig5.update_traces(texttemplate= '%{value:,}')       
    return fig3, fig4, fig5
        


 #app.run_server(mode='inline', port =1015)
app.run_server()

Dash is running on http://127.0.0.1:8050/



INFO:dash.dash:Dash is running on http://127.0.0.1:8050/



Dash app running on:


<IPython.core.display.Javascript object>