<font size="12"> **Introduction**

Objectives:
The main objectives of the project are to develop a classification system for distinguishing between successful and failed projects 
on Kickstarter based on data crawled from the platform. This system aims to provide valuable insights to project creators, guiding 
them in setting up effective campaign strategies and making informed decisions about launching crowdfunding projects.

<font size="8"> **Data set** 

We began with an extensive dataset comprising X projects seeking crowdfunding on the Kickstarter platform. This data was organized chronologically by month and year within Kickstarter. Due to the sheer volume of available data, it became apparent that processing it without a structured approach would be impractical. Consequently, we opted to focus on the most recent years, specifically from 2020 onwards, and selected one CSV file per month per year. This resulted in the utilization of 48 CSV files, representing 48 months across 4 years, amounting to ___ entries.

To construct a predictive model, it's essential to establish a structured data lake. This involves organizing the data into a format conducive to analysis and modelin The data lake should include features such as project category, funding goal, campaign duration, project description, creator background, funding success/failure, and any other relevant variables. Each entry should be accurately labeled to facilitate supervised learning.arity.g.

However, several challenges and potential biases may arise during this process:

- Sampling Bias: 
By focusing solely on recent years, there may be a bias towards contemporary trends and project characteristics. Older projects, which could offer valuable historical insights, may be underrepresented or excluded entirely.
- Selection Bias: The decision to include only one CSV per month per year may inadvertently prioritize certain types of projects or time periods, leading to a biased sample.

- Imbalanced Classes: The dataset may exhibit an imbalance between successful and failed projects, with one class significantly outnumbering the other. This can skew the predictive model's performance and accuracy.

- Missing Data: Some entries may contain missing or incomplete information, which can hinder the effectiveness of the predictive model if not addressed appropriately.

- Feature Engineering: Identifying and extracting relevant features from the raw data requires careful consideration and domain expertise. It's crucial to select features that have predictive power while avoiding those that introduce noise or multicollinearity.

It will necessitate thorough data preprocessing, feature engineering, and model validation techniques to mitigate biases and ensure the model's generalizability and effectiveness.

In [30]:
import pandas as pd
df = pd.read_csv(r'1. uncleaned_4years.csv', low_memory=False)
pd.options.display.float_format = '{:.2f}'.format


In [31]:
#we need to first exclude any duplicates there might be, so that the data exploration doesn't suffer any changes to when modelling
df = df.drop_duplicates(ignore_index=True)

#understanding the size of the df
print("The dataset has",len(df),"rows")
print("They are divided into columns and rows:",df.shape)

#what columns does it include and data types
print(df.info())


The dataset has 146925 rows
They are divided into columns and rows: (146925, 48)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146925 entries, 0 to 146924
Data columns (total 48 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   friends                   111 non-null     object 
 1   state_changed_at          146925 non-null  int64  
 2   blurb                     146914 non-null  object 
 3   id                        146925 non-null  int64  
 4   static_usd_rate           146925 non-null  float64
 5   permissions               111 non-null     object 
 6   location                  146785 non-null  object 
 7   backers_count             146925 non-null  int64  
 8   deadline                  146925 non-null  int64  
 9   source_url                146925 non-null  object 
 10  usd_type                  146851 non-null  object 
 11  photo                     146925 non-null  object 
 12  is_starred         

Some other columns do not add any data value and harm the dataset by increasing its size and making it harder to process, and some others have very little data available.

In [32]:
df.drop(['friends', 'permissions', 'is_starred', 'is_backing', 'video',
             'is_launched', 'unseen_activity_count', 'is_disliked', 'last_update_published_at',
             'unread_messages_count', 'percent_funded', 'is_liked', 'prelaunch_activated','currency_trailing_code','urls', 
             'location', 'currency_symbol','source_url', 'country'],
             axis=1, inplace=True)

<font size="3"> The variables that seem the most interesting at this point are Category, Country, State and Currency, in the object type, 
and Converted Pledged Amount (as we have several different currencies). It would be interesting to look at goal, but being in different currencies, doesn't really provide a valid insight. 
The State variable is our target variable. 
We will then explore those variables further:

In [33]:
df[['category','country_displayable_name','state','creator','currency']].describe(include=object)

Unnamed: 0,category,country_displayable_name,state,creator,currency
count,146925,146925,146925,146925,146925
unique,354,25,7,146715,15
top,"{""id"":253,""name"":""Webcomics"",""analytics_name"":...",the United States,successful,"{""id"":2118747970,""name"":""Gladys"",""slug"":""gmutu...",USD
freq,5621,99284,89622,5,99284


In [8]:
df[['converted_pledged_amount']].describe()

Unnamed: 0,converted_pledged_amount
count,146014.0
mean,16261.76
std,157111.25
min,0.0
25%,200.0
50%,2109.0
75%,8292.25
max,41754153.0


<font size="4"> The state variable will be our target variable and we noticed that it has 7 different outputs.
We also noticed that we have 15 different currencies, with over 67% being USD. Nevertheless, we will we have to use always converted amounts in our analysis, to avoid wrong conclusions caused by different currencies.

<font size="6">**Target Variable**

<font size="4">Understanding the target variable is crucial because it allows me to grasp the core problem I'm trying to solve. By understanding the target variable, I gain insights into what I aim to predict or classify. This comprehension guides my entire modeling process, from selecting appropriate features to choosing the right algorithm.

<font size="4">Checking if I have sufficient data on the target variable is essential because it directly impacts the performance and reliability of my model. Insufficient data may lead to biased or unreliable predictions. It's vital to ensure that I have an adequate number of samples for each class or category within the target variable to train a robust and accurate model. Without enough data, my model may struggle to generalize well to new, unseen instances, resulting in poor performance in real-world scenarios. Therefore, assessing data sufficiency helps me make informed decisions about whether additional data collection or sampling strategies are necessary to improve the quality of my model.

In [34]:
df['state'].unique()
df = df[(df['state'] == 'failed') | (df['state'] == 'successful')]

It is not relevant the other stages of the projec rather than sucessful and failed, as the rest of the categories are not yet through the process.

In [35]:
# saving df in csv 
df.to_csv('2. visualization_base.csv', index=False)