# Data cleaning and Explorative Data Analysis

In this notebook, data cleaning and an explorative data analysis were performed. 


In [None]:
# libraries needed
import pandas as pd 
import pickle, os, json, re
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

##### Data cleaning

Relevant columns were decided on informations that could influence the goal achievement and informations that clients had available before a potential campaign launch. The relevant columns included:\
**category** (of the project), **country**, **goal** (requiring the use of the **static_usd_rate** column, as not all goals were in USD) and the **launched_at** and **deadline** column to extract information of a campaigns duration and possible information with month or even day of the week will most likely leed to a successfull campaign. \
Further, the **state** column, having information about a campaigns state (e.g. success and fail) was included as target column and the **creator** column was included to eradicate possible duplicates. 

In [None]:
directory = 'Kickstarter_data/'
data = pd.DataFrame()
relevant_columns = ['category', 'country', 'creator', 'state', 'static_usd_rate', 'goal', 'launched_at', 'deadline']

for file in sorted(os.listdir(directory)):
    df_temp = pd.read_csv(directory+file)
    data = pd.concat([data, df_temp[relevant_columns]], ignore_index=True)

data.head()


In [None]:
data.info()

In [None]:
############ clean data of duplicates ################################################################################

data = data.drop_duplicates(ignore_index =True)   
data.info()

In total **2 duplicates** were eliminated. \
Furtnermore, by looking at the data-information, non of the columns contains null-values in any of the 209220 rows. 

In [None]:
### Looking at the target column
data.state.value_counts()

Investigating the target column reveals 5 types of entries, _successful_ which will be our 1 for the classification, _live_, marking rows that'll need to be excluded due to the uncertain outcome, _failed_, _canceled_ and _suspended_. Aiming at predicting successes, _failed_, _canceled_ and _suspended_ will be combined to our 0 for the classification.

In [None]:
data = data[data['state'] != 'live']                                            #exclude campaigns with state live 
data = data.reset_index(drop=True)
data['state'] = data['state'].apply(lambda x: 1 if x == 'successful' else 0)    #assign 1 to successfull campaigns and 0 to the rest
p_success = len(data[data['state'] == 1]) * 100/len(data)
p_fail = len(data[data['state'] == 0]) * 100/len(data)
print('Out of the campaigns in the dataset ', p_success, 'percent were successfull, and ', p_fail, 'percent were not successful.')

Out of the campaigns in the dataset  58.18 % were successfull, and 41.82 % were not successful. This means our dataset is slightly imbalanced. 

Investigating the country column

In [None]:
sns.histplot(data.country, stat='percent')
plt.title('Percantage of total campaigns per country')
plt.savefig('images/EDA/Campaigns_per_Country.png', dpi=600)

As visible from the histogram, more than 70% of the campaigns were launched from the United States, followed by Great Britain with slightly more than 10% and Canada with roughly 5%. This observation is not unexpected as Kickstarter is a US website.\
Due to the large amount of campaigns from the United States we decided to somehow cluster the countries. As Kickstarter campaigns are connected to products being delivered to the persons giving money, we decided to group by _North America_ (US and CA) against _Non-North America_ or in others words: short way to transport the product against a long way to transport the product to the people giving money.

In [None]:
# Define a new column "north america", including a 1 in case of a campaign being launched from the united states or canada and a 0 otherwise
data["north_america"] = data["country"].apply(lambda x: 1 if x in ['US', 'CA'] else 0)
data[["country", "north_america"]]

Due to the campaigns different countries of origin, the goals are given in differnet currencies. Consequently the goals needed to be transformed to one currency, for which we chose USD because the static_usd_rate was provided in the original dataset. 

In [None]:
data['goal'] = round(data['goal'] * data['static_usd_rate'],2)

In [None]:
data.goal.describe()

As visible from the description, the lowest goal is 0.01 USD and the highest goal is 152350100.00 USD as the majority of projects are located between 1500.00 USD and 13000.00 USD we were curious regarding the amount of success and failiure regarding the oultiers.

In [None]:
sns.kdeplot(data, x='goal', hue='state')
plt.title('Goals with respect to the state')
plt.savefig('images/EDA/Goals_regarding_state.png', dpi=600)

Looking at the distribution of goals, it becomes obvious that the extremely high goals are not only extremely rare, but also unsuccessfull. When investigating the projects further, it became obvious, that the projects with extremely high goals appeared unrealistic and/or nonserious. Therefore we decided to draw a line at 1000000 USD and advice clients who aim for a higher goal, that a successfull campaign is rather unlikely. It was decided to use 1000000 USD as a maximum goal, as some projects slightly above 1 Million Dollar were successfull.

In [None]:
data = data[data['goal'] < 1000000]
data = data.reset_index(drop=True)
sns.kdeplot(data, x='goal', hue='state')
plt.title('Goals below 1Mio with respect to the state')
plt.savefig('images/EDA/Goals_regarding_state_below_1M.png', dpi=600)

The category column contains information in a json string format

In [None]:
data.info()


In [None]:
cat_data = pd.DataFrame(data["category"].apply(json.loads).tolist())

In [None]:
cat_data.info()

In [None]:
cat_data.tail()

In [None]:
cat_data.slug

Looking at the different entries we had to decide between name and slug, where slug included the name, as well as the "parent"-category. Consequently we decided to add the information of slug as a new column to our dataframe. 

In [None]:
data['slug'] = cat_data['slug']
data.slug.nunique()

In [None]:
data.info()

As slug contains 169 different entries, most of which are very specific, we decided to focus on the parent categories e.g. technology, fashion, music. 

In [None]:
data["slug"] = data["slug"].apply(lambda x: re.split(r'/', x)[0])

In [None]:
print('Number of project categories: ',data.slug.nunique())
sns.histplot(data, x='slug', hue='state', multiple='stack', palette='viridis')
plt.title('Projects per Category with respect to success (state = 1) and fail (state = 0)')
plt.xticks(rotation=90)
plt.xlabel('Project Category')
plt.savefig('images/EDA/Projects_per_Category.png', dpi=600)



As visible from the histogram, the majority of campaigns launched on Kickstarter are from the categories _film/video_ and _music_, while the least amount of campaign were launched in the category _dance_, followed by _journalism_. When looking at the amount of failed an successfull campaigns, it becomes abvious, that for the majority of categorys more than 50% of the campaigns succeed, with _dance_ and _comics_ being the relatively most successful categories. However, there are some exceptions, namely _technology_, _food_ and _journalism_.

As seen when looking at the dataframe info, the columns contain floats, instead of datetime objects. 

In [None]:
data.info()

In [None]:
# transform columns containing dates to datetime 
data['launched_at'] = pd.to_datetime(data['launched_at'], unit='s')
data['deadline'] = pd.to_datetime(data['deadline'], unit='s')

In [None]:
data.info()

A saying about launching campaigns is to do it on tuesdays for the highest success. We aimed at finding out whether this saying holds true, and wether we might gain more insight on which days of the week and which months of the year might be most advisable for a campaign launch. \
Therefore we added columns including information about the weekday and month of the launch. 

In [None]:
# extract the day and month components
data['launched_at_weekday'] = data['launched_at'].dt.weekday
data['launched_at_month'] = data['launched_at'].dt.month

In [None]:
data.head()

It was found that the 0th weekday is the Monday and the 1st month is January

In [None]:
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

sns.histplot(data, x='launched_at_weekday', hue='state', multiple='stack', discrete= True, palette='viridis')
plt.title('Projects per Weekday with respect to success (state = 1) and fail (state = 0)')
plt.xticks(ticks = np.arange(0,7),labels = days)
plt.xlabel('Day of the Week')
plt.savefig('images/EDA/Projects_per_Weekday.png', dpi=600)

The histogram shows that in fact most of the campaigns are launched on tuesdays, while least of the campaigns are launched on the weekends. To get a clearer view on which weekday is connected to the highest rate of successfull campaigns, we need to look at the percentages of successfull campaigns per weekdays.

In [None]:
from matplotlib import ticker
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
ax = sns.histplot(data, y='launched_at_weekday', hue='state', multiple='fill', palette='viridis', discrete=True, bins = 7)
sns.move_legend(ax, loc="upper left", bbox_to_anchor=(1,1))
for p in ax.patches:
        h = p.get_width()
        if h > 0: # skip empty bars
            txt = f'{h * 100:.2f} %'
            txt_y = p.get_y() + p.get_height() / 2
            txt_x = p.get_x() + h / 2
            ax.text(txt_x, txt_y, txt, ha='center', va='center')
# for bars in ax.containers:
#     heights = [b.get_height() for b in bars]
#     labels = [f'{h * 100:.1f}%' if h > 0.001 else '' for h in heights]
#     #ax.bar_label(bars, labels=labels, label_type='center')
ax.xaxis.set_major_formatter(ticker.PercentFormatter(1))
ax.yaxis.set_major_locator(ticker.FixedLocator(np.arange(0,7)))
ax.yaxis.set_major_formatter(ticker.FixedFormatter(days))
ax.set_title('Percentage of successfull (state = 1) and failed (state = 0) campaigns per Weekday')
ax.set_ylabel('Day of the Week')
ax.set_xlabel('Percentage')
ax.figure.savefig('images/EDA/Projects_per_Weekday_success_percantage.png',dpi=600)


It is visible, that the chance of a successfull campaign is in fact slightly higher on Tuesdays compared to other days of the week and the lowest on Saturdays. \
Now we wanted to look at the influence of the months.

In [None]:
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

ax = sns.histplot(data, x='launched_at_month', hue='state', multiple='stack', discrete = True, palette='viridis')
sns.move_legend(ax, loc="upper left", bbox_to_anchor=(1,1))
ax.set_title('Projects per Month with respect to success (state = 1) and fail (state = 0)')
ax.xaxis.set_major_locator(ticker.FixedLocator(np.arange(1,13)))            # ?????????? haha - comment by genus
ax.xaxis.set_major_formatter(ticker.FixedFormatter(month))
ax.set_xlabel('Month of the Year')

plt.savefig('images/EDA/Projects_per_month.png', dpi=600)

In [None]:
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

ax = sns.histplot(data, y='launched_at_month', hue='state', multiple='fill', palette='viridis', discrete=True, bins = 12)
sns.move_legend(ax, loc="upper left", bbox_to_anchor=(1,1))
for p in ax.patches:
        h = p.get_width()
        if h > 0: # skip empty bars
            txt = f'{h * 100:.2f} %'
            txt_y = p.get_y() + p.get_height() / 2
            txt_x = p.get_x() + h / 2
            ax.text(txt_x, txt_y, txt, ha='center', va='center')
ax.xaxis.set_major_formatter(ticker.PercentFormatter(1))
ax.yaxis.set_major_locator(ticker.FixedLocator(np.arange(1,13)))
ax.yaxis.set_major_formatter(ticker.FixedFormatter(month))
ax.set_title('Percentage of successfull (state = 1) and failed (state = 0) campaigns per month')
ax.set_ylabel('Month of the Year')
ax.set_xlabel('Percentage')
ax.figure.savefig('images/EDA/Projects_per_month_success_percantage.png',dpi=600)


Regarding the month in which a campaign is launched, it is seen, that October and July are the months with the most campaign launches, while the least campaigns are launched in December. Regarding the success of launched campaigns, campaigns launched in October appear to be slightly more likely to succeed, while campaigns launched in July or December are less likely to succeed.

Another feature that we assumed to siginificantly contribute to a campaigns success and should be known about before launching a campaign is its duration. Therefore we calucalted the duration using the launched_at and the deadline columns

In [None]:
data['duration_days'] = (data["deadline"] - data["launched_at"]).dt.days

In [None]:
data.info()

In [None]:
sns.kdeplot(data, x='duration_days', hue='state')
plt.savefig('images/EDA/Duration_regarding_state.png', dpi=600)

In [None]:
ax = sns.boxplot(data=data, y='duration_days', x='state', saturation=0.6)
ax.axhline(30, color="0.3", dashes=(2,2))
ax.figure.savefig('images/EDA/Duration_success_fail_boxplot.png',dpi=600)


From the kernel density estimate it appears that the majority of campaigns runs for roughly one month (approx. 30 days). Especially regarding the unsuccessfull campaigns, a second peak can be seen at roughly two months (approx. 60 days).

Drop unnecessary columns: _category_, _country_, _creator_, _static_usd_rate_, _launched_at_, _deadline_, 

In [None]:
# Execute only once! 
print(data.columns)
data = data.drop(['category', 'country', 'creator', 'static_usd_rate', 'launched_at', 'deadline'], axis =1)
print(data.columns)

Finally a pairplot gives an overview about how the data is connected.

In [None]:
sns.pairplot(data, hue='state')
plt.savefig('images/EDA/pairplot_relevant_features.png', dpi=600)