# Introduction to Machine Learning Workshop
## Data Science Society at Berkeley

* __Date__: April 18, 2018

* __Author__ - Alex Nakagawa

* _Last Updated_: April 18, 2018

* __License__: Feel free to use this notebook in any way that you would like.

## Introduction

Welcome to the Introduction to Machine Learning Workshop! We appreciate you coming to take the time to learn about one of the most exciting subjects in industry today: machine learning! We are here to help guide you through the basics of how Machine Learning is structured through multiple theoretical case study analyses.

The purpose of this Jupyter Notebook is to give you a guided walkthrough of case studies that we will go over in class so that you can understand every detail of what goes on in each case study. Let's begin!

## Structure

We've developed a general, but somewhat rigid structure to follow when you do any kind of data science project in model building:

| Step | Name |
| 1 | Preprocessing |
| 2 | EDA |
| 3 | Model Creation |
| 4 | Evaluation |
| 5 | Feature Engineering |



## Legend

If you see this arrow, it will ask you to fill in code or text to answer a question: <img src='down_arrow.png' style='width:50px;height:50px;'></img>

## Part 0: Packages

The following packages are STANDARD for any data science project. It is (almost) vital that you use these almost everytime. __Run the next block__.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Optional
# %matplotlib inline

Why do we need these specific packages? If you don't recognize a package, write about its benefits below

<img src='down_arrow.png' style='width:50px;height:50px;'></img>

`YOUR ANSWER HERE`

## Part I: Simple Linear Regression

__Linear regression__ is a predictive modeling technique for predicting a numeric response variable based on features.  
"Linear" in the name linear regression refers to the fact that this method fits a model where response bears linear relationship with features. (ie Z is proportional to first power of x)

__Z = X0 + a(X1) + b(X2) +.... where:__   
Z: predicted response  
X0: intercept  
a,b,..: Coefficients of X1,X2..  

If Y is the actual response and Z is the predicted response,    
__Y-Z= Residual__  
Average Residual defines model performance,residual equal to zero represents a perfect fit model.

In [None]:
'''Source: Scikit learn
Code source: Jaques Grobler
License: BSD 3 clause'''
from sklearn.linear_model import LinearRegression

example_dff = pd.DataFrame(np.random.randint(0,100,size=(100, 1)),columns=['X'])
example_dff['C']=5.1*example_dff['X']
# example_dff['C']=5.1*example_dff['X']**2
X_reg = example_dff[['X']]

Y_reg = example_dff['C']

# Create linear regression object
model = LinearRegression()

# Train the model using the training sets
model.fit(X_reg, Y_reg)
Z_reg=model.predict(X_reg)

# The coefficients
print('Coefficients:', model.coef_)
# The mean squared error
print("Mean squared error:",np.mean((Z_reg - Y_reg) ** 2))

# Plot outputs
plt.scatter(X_reg['X'], Y_reg,  color='red')
plt.plot(X_reg['X'], Z_reg, color='blue',
         linewidth=3)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Linear Regression using data with one feature -X')
plt.xticks(())
plt.yticks(())

plt.show()

## Part II: Simple Classification

This case study will go over the structure of creating a simple classification.

We will first look at Kickstarter data found on Kaggle [here]('https://www.kaggle.com/kemical/kickstarter-projects'). Before we do ANYTHING, what are we even working with? READ any available documentation for a dataset you find publicly online. Why do we want to do this? <img src='down_arrow.png' style='width:50px;height:50px;'></img>

`YOUR ANSWER HERE`

### Section 1. Preprocessing

Let's take a look at what our data actually looks like now. Several things we want to check for:

* Are there missing values?
* Are the columns the dtypes that we want?
* What exactly does each row represent?

More details about what to look for from a public data set are here: https://www.textbook.ds100.org/ch01/the_students_of_ds100_1.html

We'll import the data using the `pandas.read_csv` function.

In [None]:
kickstarter = pd.read_csv('kickstarter.csv')
kickstarter.head()

See anything interesting? Probably not from just looking at the first 5 rows. Let's continue to familiarize ourself with this dataset. We're now going to summarize our data. Luckily, `pandas` makes that pretty easy.

In [None]:
kickstarter.describe()

In [None]:
kickstarter.isnull().sum()

In [None]:
kickstarter.info()

What do you think are some problems we may run into looking at these three function calls?

<img src='down_arrow.png' style='width:50px;height:50px;'></img>

`YOUR ANSWER HERE`

Wow, that's a lot of missing values for `usd_pledged`... let's see what those rows look like

In [None]:
null_rows = kickstarter.isnull().any(axis=1)
kickstarter[null_rows]

How should we deal with these missing values? 

<img src='down_arrow.png' style='width:50px;height:50px;'></img>

`YOUR ANSWER HERE`

There are a ton of rows, over 300,000 of them! It'd be ok to drop the 3700 ish that are missing

In [None]:
print('Before: ', kickstarter.shape)
kickstarter = kickstarter.dropna(axis=0, how='any')
print('After: ', kickstarter.shape)
kickstarter.isnull().sum()

### Section 2. EDA (Exploratory Data Analysis)

We've now set our data into a good place to begin exploring statistically.

Now that we've fixed the values, let's further our understanding by finding the different categories for our columns. We noticed that there's a `main_category` variable, let's see what kinds of values that will take 

In [None]:
kickstarter['main_category'].unique()

OR

In [None]:
count_categories = kickstarter.groupby('main_category').size()
count_categories

In [None]:
g = sns.barplot(count_categories.index, count_categories)
for item in g.get_xticklabels():
    item.set_rotation(90)

Okay... interesting. Perhaps I want to see the current state of each project. Let's find a way to visualize that.

In [None]:
states = kickstarter.groupby('state').size()
states

In [None]:
states.plot(kind='bar')
plt.title('Counts for States of All Kickstarter Campaigns');

This visualization is _okay_, but how can we improve it?  <img src='down_arrow.png' style='width:50px;height:50px;'></img>

`YOUR ANSWER HERE`

In [None]:
states_log = np.log(states)
states_log.plot(kind='bar')
plt.title('Log-Counts for States of All Kickstarter Campaigns');

Alright, we're now going to take the failed and successful columns in the film & video category to formulate some kind of question.  <img src='down_arrow.png' style='width:50px;height:50px;'></img>

In [None]:
kickstarter_failed_successful = kickstarter[(kickstarter['state']=='successful') |
                                                  (kickstarter['state']=='failed')]
kickstarter_failed_successful

In [None]:
kickstarter_failed_successful['difference'] = kickstarter_failed_successful['usd_pledged_real'] - \
                                                 kickstarter_failed_successful['usd_goal_real']
kickstarter_failed_successful[['difference']]

In [None]:
kickstarter_failed_successful.sort_values(by='difference', ascending=False)

Here's a general question: what kind of factors go into a successful/failed Kickstarter project? Could it be that the number of backers have an impact on whether people want to back a certain project? Or could it be perhaps that one country has more projects than another, which would generate more interest? Give an example of another column that could lead to a "successful" state for a project (besides the fact that it reaches its goal). <img src='down_arrow.png' style='width:50px;height:50px;'></img>

`YOUR ANSWER HERE`

Hoorah! Now we know that we want to do a classification on the dataset. Some of the columns are not numeric, so let's fix that now.

In [None]:
kickstarter_failed_successful = kickstarter_failed_successful.drop(['name', 
                                                                    'category', 
                                                                    'deadline', 
                                                                    'launched'], axis=1)

In [None]:
categories_list = count_categories.index.tolist()

kickstarter_failed_successful['main_category'] = kickstarter_failed_successful['main_category'].replace( 
    categories_list,
    np.arange(len(categories_list))
)

kickstarter_failed_successful['currency'] = kickstarter_failed_successful['currency'].replace( 
    kickstarter_failed_successful['currency'].unique(),
    np.arange(len(kickstarter_failed_successful['currency'].unique()))
)

kickstarter_failed_successful['country'] = kickstarter_failed_successful['country'].replace( 
    kickstarter_failed_successful['country'].unique(),
    np.arange(len(kickstarter_failed_successful['country'].unique()))
)

kickstarter_failed_successful['state'] = kickstarter_failed_successful['state'].replace( 
    ['successful', 'failed'],
    [1, 0]
)


In [None]:
print("Kickstarter dataframe shape (just failed, successful, and dropped cols): " , kickstarter_failed_successful.shape)
kickstarter_failed_successful

There's one final worry in our data... but it may not be so obvious. What is it? <img src='down_arrow.png' style='width:50px;height:50px;'></img>

`YOUR ANSWER HERE`

### Section 3. Model

In [None]:
df = kickstarter_failed_successful.sample(5000)
df

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(df.drop('state', axis=1),
                                                    df['state'], test_size=0.2,
                                                    random_state=7)

# Creating model
model = SVC()

# Fitting model
model.fit(X_train, y_train)
predicted = cross_val_predict(model, X_test, y_test, cv=5)

### Section 4. Model Evaluation + 5. Feature Engineering (Optional)

In [None]:
# Scoring
metrics.accuracy_score(y_test, predicted)

How was the model evaluation score? Not so great, right? let's think of ways to improve: <img src='down_arrow.png' style='width:50px;height:50px;'></img>

`YOUR ANSWER HERE`



The following process is called to normalize your dataframe.

In [None]:
df['goal'] = (df['goal'] - df['goal'].mean()) / (df['goal'].max() - df['goal'].min())
df['pledged'] = (df['pledged'] - df['pledged'].mean()) / (df['pledged'].max() - df['pledged'].min())
df['backers'] = (df['backers'] - df['backers'].mean()) / (df['backers'].max() - df['backers'].min())
df['usd pledged'] = (df['usd pledged'] - df['usd pledged'].mean()) / (df['usd pledged'].max() - df['usd pledged'].min())
df['usd_pledged_real'] = (df['usd_pledged_real'] - df['usd_pledged_real'].mean()) / (df['usd_pledged_real'].max() - df['usd_pledged_real'].min())
df['usd_goal_real'] = (df['usd_goal_real'] - df['usd_goal_real'].mean()) / (df['usd_goal_real'].max() - df['usd_goal_real'].min())
df['difference'] = (df['difference'] - df['difference'].mean()) / (df['difference'].max() - df['difference'].min())

X_train, X_test, y_train, y_test = train_test_split(df.drop('state', axis=1),
                                                    df['state'], test_size=0.2,
                                                    random_state=7)

# Creating model
model = SVC()

# Fitting model
model.fit(X_train, y_train)
predicted = cross_val_predict(model, X_test, y_test, cv=5)
metrics.accuracy_score(y_test, predicted)