# Kickstarter Project

### Definition of relevant columns

* backers_count: amount of people pledging money to the project                                     
* category -> 'slug': name of the projects' specific parent- & sub-category (part of json string)
* country: country of the projects creator 
* creator -> 'id': id of the creator -> to be used as categorical variable (part of json string)
* goal: information on the amount of money needed to succeed in the local currency of the project
* launched_at: start date? of the project ()
* deadline: end date of the project ()
* spotlight: project highlighted on the website
* staff_pick: marked by a staff member of kickstarter (more attention drawn towards project)
* state: (successful/failed/canceled/live/suspended) -> exclude 'live' and combine 'canceled', 'suspended' with 'failed'
* static_usd_rate: exchange rate to transform goal in every column from current currency to USD



### Stakeholder: Project creator 
### Question: Is it useful to put much effort into launching a campaign on kickstarter? 
### Measure: Is the campaign likely to succeed or fail?

## Import Libraries

In [None]:
# Libraries

import os, json, re
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder



## Important Functions

In [None]:
######### functions for pre-processing ####################################################################

def extract_year_date_month(df, column):
    '''Takes a column, converts it to datetime, and creates new columns with day, month and year
    The new columns are named:
        - column_weekday
        - column_month
        - column_year
    '''
    
    # Convert column in df to datetime
    df[column] = pd.to_datetime(df[column], unit='s')

    # extract the day, month, and year components
    df[column + '_' + 'weekday'] = df[column].dt.weekday
    df[column + '_' + 'month'] = df[column].dt.month
    #df[column + '_' + 'year'] = df[column].dt.year

    return df


def duration(df, column1, column2):
    '''Returns the duration in days between 2 columns with datetime and puts it into a new colum
        - column1: start date
        - column2: end date
    '''
    df['duration_days'] = (df[column2] - df[column1]).dt.days

    return df

def convert_to_usd(df):
    return round(df['goal'] * df['static_usd_rate'],2)

######### functions for analysing predictions ########################################################## 



## Load data into one dataframe

In [None]:
data =pd

In [None]:
directory = 'data-2/'
data = pd.DataFrame()
relevant_columns = ['backers_count', 'category', 'country', 'creator', 'spotlight', 'staff_pick', 'state', 'static_usd_rate', 'goal', 'launched_at', 'deadline']

for file in sorted(os.listdir(directory)):
    df_temp = pd.read_csv(directory+file)
    data = pd.concat([data, df_temp[relevant_columns]], ignore_index=True)

data.head()

## Work on the json string columns

### Extract the 'slug' parameter from the category column and drop the category column

In [None]:
cat_data = data["category"].apply(json.loads)
cat_data = pd.DataFrame(cat_data.tolist())
data['slug'] = cat_data['slug']
data = data.drop("category", axis=1)

### Extract the ID from the creator column and drop the creator column

In [None]:
data["creator_id"] = data["creator"].apply(lambda x: re.findall(r'\d+', x)[0])
data = data.drop("creator", axis=1)


## Work on the datetime columns

### Convert date-data to type date.time()

In [None]:
data['launched_at'] = pd.to_datetime(data['launched_at'], unit='s')
data['deadline'] = pd.to_datetime(data['deadline'], unit='s')

### Extract weekday and month of kickstarter project launch, as well as the duration of the kickstarter project and drop the "launched_at" and "deadline" column

In [None]:
data = extract_year_date_month(data, 'launched_at')
data = duration(data, 'launched_at', 'deadline')

data = data.drop(['launched_at', 'deadline'], axis=1)

### Convert unit of "goal" to USD and drop "static_usd_rate" and "goal" column

In [None]:
data['goal_in_usd'] = data.apply(convert_to_usd, axis=1)
data = data.drop(['static_usd_rate', 'goal'], axis=1)

In [None]:
data.head(10)

# Data cleaing and Exploratory data Analysis

In [None]:
# find the missing numbers
data.isna().sum()

In [None]:
data.shape

In [None]:

# find and print duplicate value 
def print_duplicate_counts(data):
    """
    Print the total number of duplicate values in each column of the DataFrame.

    Parameters:
    - data: pandas DataFrame
    """
    for column in data.columns:
        duplicate_count = data[column].duplicated().sum()
        print(f"'{column}' has {duplicate_count} duplicate value(s).")

print_duplicate_counts(data)


* Our data does not have null values 
* we have duplicates because of catogorical data. 

In [None]:
data.info()

In [None]:
data.describe().round(2)

# Maximum number of successful project with week days

In [None]:
# convert the launched_at_weekday to days name 
import calendar

# Define a function to convert day numbers to day names
def number_to_day_name(day_number):
    return calendar.day_name[day_number]

# Apply the function to create a new column with day names
data['day_name'] = data['launched_at_weekday'].apply(number_to_day_name)

# Display the resulting DataFrame
print(data)


* Q. If we launch a project on perticular weekday, does it effect the success of the project?
* Ans. Yes, indeed it increase the success rate if we launched on Tuesday. 

In [None]:
# Count the occurrences of each combination of 'state' and 'day_name'
count_data = data.groupby(['state', 'day_name']).size().reset_index(name='count')

# Pivot the data to get 'state' as columns
pivot_data = count_data.pivot(index='day_name', columns='state', values='count').fillna(0)

# Plotting
sns.set(style="whitegrid")  
pivot_data.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.title('Kickstarter Projects by State and Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Projects')
plt.legend(title='State', loc='upper center', bbox_to_anchor=(0.5, 1), fancybox=True, shadow=True)
plt.show()

In [None]:
# Group by 'day_name_column' and 'state_column' and count the occurrences
grouped_data = data.groupby(['day_name', 'state']).size().reset_index(name='count')

# Filter only successful projects
successful_projects = grouped_data[grouped_data['state'] == 'successful']

# Find the day with the maximum successful projects
max_successful_day = successful_projects.loc[successful_projects['count'].idxmax()]

# Print the result
print("Day with Maximum Successful Projects:", max_successful_day['day_name'])
print("Number of Successful Projects on that day:", max_successful_day['count'])

In [None]:
data.head()

In [None]:
# drop the day_name column 
#data1 = data.drop('day_name', axis=1)

# Drop all the live project and encode successfull as 1 and other as 0

In [None]:
#Drop all the live project
data1 = data1[data1['state'] != 'live']

In [None]:
data['state'].unique()

In [None]:
data1['state'] = data1['state'].apply(lambda x: 1 if x == 'successful' else 0)


In [None]:
data1['state'].unique()

In [None]:
data1.head()

# Encode the categorical column to continuous data

In [None]:
print(data1.columns)

In [None]:
# Encode the categorical column to continuous data
cat_features = ['country', 'spotlight', 'staff_pick', 'slug']
encoder = LabelEncoder()
encoded = data1[cat_features].apply(encoder.fit_transform)

data_cols = ['backers_count', 'state',
       'creator_id', 'launched_at_weekday', 'launched_at_month',
       'duration_days', 'goal_in_usd']
baseline_data = data1[data_cols].join(encoded)

In [None]:
baseline_data.head

In [None]:
baseline_data.shape

In [None]:
sns.pairplot(baseline_data, hue='state')