# Classification and Model Selection

## Classifying Kickstarter Campaigns

Kickstarter is a crowdfunding platform with a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing new projects to life.

Until now, more than $3 billion dollars have been contributed by the members in fueling creative projects.
The projects can be literally anything – a device, a game, an app, a film, etc.

Kickstarter works on all or nothing basis: a campaign is launched with a certain amount they want to raise, if it doesn't meet its goal, the project owner gets nothing. For example: if a projects's goal is $\$5000$ and it receives $\$4999$ in funding, the project won't be a success.

If you have a project that you would like to post on Kickstarter now, can you predict whether it will be successfully funded or not? Looking into the dataset, what useful information can you extract from it, which variables are informative for your prediction and can you interpret the model?

The goal of this project is to build a classifier to predict whether a project will be successfully funded or not. 

**💡 You can use any algorithm of your choice.**

We will use `sklearn` and the usual data science libraries such as `pandas` and `numpy`.

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

### Baseline Model

In this exercise, we are looking to outperform the performance of a simple `baseline` model. This `baseline` is a simple logistic regression with only two features: `goal_usd` (adjusted goal) and `usa` (whether the campaign happened in the US)

The code to build this `baseline` is shown below:

```Python
from sklearn.linear_model import LogisticRegression

# Conduct some custom processing on your training data
df["usa"] = df["country"] == "US"
df["goal_usd"] = df["goal"] * df["static_usd_rate"]

df = df[["goal_usd", "usa", "state"]]

# Conduct the same processing on your testing data
df_eval["usa"] = df_eval["country"] == "US"
df_eval["goal_usd"] = df_eval["goal"] * df_eval["static_usd_rate"]

df_eval = df_eval[["goal_usd", "usa", "state"]]

X = df.drop(["state"], axis=1)
y = df["state"]

X_eval = df_eval.drop(["state"], axis=1)

model = LogisticRegression()
model.fit(X, y)

y_pred = model.predict(X_eval)
```

### Our Model

To kick things off, let's import and use our favourite data processing library, `pandas`, to retrieve the data that we will use to build a machine learning model.

In this assignment, we are going to load in two datasets. The first, `df`, is going to contain all the data we will need to train and test a model. This will include the labels indicating whether or not the project was successfully funded. The second dataset, `df_eval`, is going to contain all the data that our model will be evaluated on by **KATE**. It does not include the labels indicating project success, so can be viewed as held-out test data. 

We will need to process `df_eval` in exactly the same way as `df`, then use our model trained on `df` to make predictions about `df_eval`. On submission, **KATE** will evaluate these predictions against their labels (which **KATE** has access to).


Run the cell below to load the raw data. Note that `pandas` is pretty smart and can read these ZipFiles into regular `DataFrames`:

In [1]:
import pandas as pd

df = pd.read_csv("data/kickstarter.gz")
df_eval = pd.read_csv("data/kickstarter_eval.gz")

print(df.shape)
print(df_eval.shape)

(50000, 26)
(10000, 26)


We have also displayed the dimensions of our `df` and `df_eval`. Notice that the `df_eval` is only $10,000$ rows. As mentioned earlier, this will be our test set for submissions to **KATE**.

**This means that we cannot train on `df_eval`**


The aim of this practical is to:
  * Process `df` into an input dataframe `X` and a label dataframe `y`
  * Process `df_eval` into an input dataframe `X_eval` (in the same way as we processed `df` into `X`)
  * Train a classification model of our choice on `X` and `y`
  * Submit our code to KATE, where our model will be evaluated on `X_eval` and `y_eval`

Let's kick things off by checking out our `df`. Note that the `state` column contains our success labels:

In [2]:
df.head()

Unnamed: 0,id,photo,name,blurb,goal,slug,disable_communication,country,currency,currency_symbol,...,location,category,profile,urls,source_url,friends,is_starred,is_backing,permissions,state
0,805910621,"{""small"":""https://ksr-ugc.imgix.net/assets/012...",DOCUMENTARY FILM titled FROM RAGS TO SPIRITUAL...,A MOVIE ABOUT THE WILLINGNESS TO BREAK FREE FR...,125000.0,movie-made-from-book-titled-from-rags-to-spiri...,False,US,USD,$,...,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,0.0
1,1279627995,"{""small"":""https://ksr-ugc.imgix.net/assets/011...","American Politics, Policy, Power and Profit",Everything you should know about really big go...,9800.0,american-politics-policy-power-and-profit,False,US,USD,$,...,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,0.0
2,1306016155,"{""small"":""https://ksr-ugc.imgix.net/assets/013...","Drew Jacobs Official ""Kiss Me"" Music Video","Be a part of the new ""Kiss Me"" Official Music ...",2500.0,drew-jacobs-official-kiss-me-music-video,False,US,USD,$,...,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,1.0
3,658851276,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",Still Loved,When their dreams are shattered by the loss of...,10000.0,still-loved,False,GB,GBP,Â£,...,"{""country"":""GB"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,1.0
4,1971770539,"{""small"":""https://ksr-ugc.imgix.net/assets/012...",Nine Blackmon's HATER Film Project,HATER is a mock rock doc about why the Rucker ...,5500.0,nine-blackmons-hater-film-project,False,US,USD,$,...,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,0.0


In addition to using the `.head()` function, let's also retrieve some more information about our data:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      50000 non-null  int64  
 1   photo                   50000 non-null  object 
 2   name                    49998 non-null  object 
 3   blurb                   49998 non-null  object 
 4   goal                    50000 non-null  float64
 5   slug                    50000 non-null  object 
 6   disable_communication   50000 non-null  bool   
 7   country                 50000 non-null  object 
 8   currency                50000 non-null  object 
 9   currency_symbol         50000 non-null  object 
 10  currency_trailing_code  50000 non-null  bool   
 11  deadline                50000 non-null  int64  
 12  created_at              50000 non-null  int64  
 13  launched_at             50000 non-null  int64  
 14  static_usd_rate         50000 non-null

From the above, we can see that there are $50,000$ projects in `df` and, apart from a handful of columns, most of our data is not null - what a relief!

In [4]:
df_eval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10000 non-null  int64  
 1   photo                   10000 non-null  object 
 2   name                    10000 non-null  object 
 3   blurb                   10000 non-null  object 
 4   goal                    10000 non-null  float64
 5   slug                    10000 non-null  object 
 6   disable_communication   10000 non-null  bool   
 7   country                 10000 non-null  object 
 8   currency                10000 non-null  object 
 9   currency_symbol         10000 non-null  object 
 10  currency_trailing_code  10000 non-null  bool   
 11  deadline                10000 non-null  int64  
 12  created_at              10000 non-null  int64  
 13  launched_at             10000 non-null  int64  
 14  static_usd_rate         10000 non-null 

Unlike `df`, `df_eval` contains only $10,000$ projects. Also note that the target variable, `state` is null for all entries - as mentioned earlier, this is stored on **KATE** for evaluating our model when we submit our code.

In [5]:
print("df labels:      ", df.state.unique())
print("df_eval labels: ", df_eval.state.unique())

df labels:       [0. 1.]
df_eval labels:  [nan]


**Notes on the dataset**:
* The target `state` corresponds to a binary outcome: `0` for failed, `1` for successful. 
* The variables `'deadline'`, `'created_at'`, `'launched_at'` are stored in Unix time format.

## Part 1: Preprocessing the data

Although our data is relatively clean, it is not yet in a state where we can train a model. For instance, both `df` and `df_eval` contains columns used for training (features) as well as the target column although this is, of course, null for `df_eval`.

What we have to do now is preprocess our data. Specifically, we need to:
 - Build a training set: `X` and `y`
 - Build an evaluation set: `X_eval`
 
 <br>
 
Let's start by extracting the `state` column from `df` into a variable called `y`:
 - Create a new variable `y` from `df["state"]`
 - Drop the `state` column from `df` and `df_eval`

In [6]:
y = df["state"]

df.drop("state", axis=1, inplace=True)
df_eval.drop("state", axis=1, inplace=True)

### 1.1 Remove redundant columns

After removing the target variable `y` from our input data, we can start processing.

Remove all the columns that **you** think are not salient for this classification task. For instance, `id`, `photo`, `slug`, and `disable_communication` are some features which are not likely to be relevant. The choice of which features to retain, however, is yours to make. Remember to remove the same columns from `df` as `df_eval`.

You can use the `.drop()` function from `pandas` to remove columns (remember to specify `axis=1`)

In [7]:
# Your code here:
columns_to_drop = ['id','photo','slug','disable_communication', 'friends','is_starred','is_backing','permissions', 'location','profile','urls','source_url','creator','currency_symbol','name']

df = df.drop(columns=columns_to_drop, axis = 1)
df_eval = df_eval.drop(columns=columns_to_drop, axis = 1)

### 1.2 Fill null values

Looking at the output of `df.info()` above, we can see that some of the columns which we might be interested in as features contain some null values. Null values are, in general, a problem for machine learning models and can cause your code to break. How you choose to deal with them, however, will depend in large part on how you intend to process your data. For instance, if your input data consists of strings that you wish to generate a word count feature from, you can just fill in the null values with empty strings (`""`).

Thankfully, `pandas` has a helpful function for dealing with null values: the `.fillna()` function. Remember to do the same to `df` as `df_eval`:

In [8]:
# Your code here:
df['blurb'] = df['blurb'].fillna('')
df_eval['blurb'] = df_eval['blurb'].fillna('')


### 1.3 Additional Processing

In the previous two exercises we have covered the most basic steps in preprocessing: dropping redundant columns and working with null values. However, there is *so* much more that we can do to extract useful information from our data. 

For instance, the `blurb` column contains unique strings and so, in its current form, isn't a particularly useful feature. Instead, we could create a new feature representing the length of the `blurb`, or the number of words in the `blurb`. 

Other string-type columns, such as `country`, contain categorical data. As there are a lot of countries represented, we might want to aggregate these into regions (e.g. `Europe`, `Asia`, ...). We can then convert this categorical data into a one-hot encoding using `sklearn`.

What we are describing here is what's known as feature engineering and is an art and a science in its own right. 

Let's start this processing by importing some libraries and functions that can help us create features. Notice that we import the `StandardScaler` from `sklearn`. We can use this function on our numerical data to normalise it, which is an important step in training a machine learning model.

**💡 In the following cell, you can use feature engineering to create features that you think might be useful.**

<br>

You may want to put all your processing within a function (such as `processing()`) or may want to do it just as plain Python code. It's entirely up to you!

However, once you have processed `df` and `df_eval`, you must assign them to input variables `X` and `X_eval`.

In [9]:
country_map = {'AO': 'Africa', 'BF': 'Africa', 'BI': 'Africa', 'BJ': 'Africa', 'BW': 'Africa', 'CF': 'Africa', 'CG': 'Africa', 'CI': 'Africa', 'CM': 'Africa', 'CV': 'Africa', 'DJ': 'Africa', 'DZ': 'Africa',
               'EG': 'Africa', 'EH': 'Africa', 'ER': 'Africa', 'ET': 'Africa', 'GA': 'Africa', 'GH': 'Africa', 'GM': 'Africa', 'GN': 'Africa', 'GQ': 'Africa', 'GW': 'Africa', 'KE': 'Africa', 'KM': 'Africa',
               'LR': 'Africa', 'LS': 'Africa', 'LY': 'Africa', 'MA': 'Africa', 'MG': 'Africa', 'ML': 'Africa', 'MR': 'Africa', 'MU': 'Africa', 'MW': 'Africa', 'MZ': 'Africa', 'NA': 'Africa', 'NE': 'Africa',
               'NG': 'Africa', 'RE': 'Africa', 'RW': 'Africa', 'SC': 'Africa', 'SD': 'Africa', 'SL': 'Africa', 'SN': 'Africa', 'SO': 'Africa', 'ST': 'Africa', 'SZ': 'Africa', 'TD': 'Africa', 'TG': 'Africa',
               'TN': 'Africa', 'TZ': 'Africa', 'UG': 'Africa', 'YT': 'Africa', 'ZA': 'Africa', 'ZM': 'Africa', 'ZR': 'Africa', 'ZW': 'Africa', 'AG': 'Americas', 'AI': 'Americas', 'AN': 'Americas', 'AR': 'Americas',
                'AW': 'Americas', 'BB': 'Americas', 'BM': 'Americas', 'BO': 'Americas', 'BR': 'Americas', 'BS': 'Americas', 'BZ': 'Americas', 'CA': 'America', 'CL': 'Americas', 'CO': 'Americas', 'CR': 'Americas', 
                'CU': 'Americas', 'DM': 'Americas', 'DO': 'Americas', 'EC': 'Americas', 'FK': 'Americas', 'GD': 'Americas', 'GF': 'Americas', 'GL': 'America', 'GP': 'Americas', 'GT': 'Americas', 'GY': 'Americas', 
                'HN': 'Americas', 'HT': 'Americas', 'JM': 'Americas', 'KN': 'Americas', 'KY': 'Americas', 'LC': 'Americas', 'MQ': 'Americas', 'MS': 'Americas', 'MX': 'Americas', 'NI': 'Americas', 'PA': 'Americas', 
                'PE': 'Americas', 'PM': 'America', 'PR': 'Americas', 'PY': 'Americas', 'SR': 'Americas', 'SV': 'Americas', 'TC': 'Americas', 'TT': 'Americas', 'UE': 'Americas', 'US': 'America', 'UY': 'Americas', 
                'VC': 'Americas', 'VG': 'Americas', 'VI': 'Americas', 'AQ': 'Antarctica', 'AE': 'Asia', 'AF': 'Asia', 'AM': 'Asia', 'AZ': 'Asia', 'BD': 'Asia', 'BH': 'Asia', 'BN': 'Asia', 'BT': 'Asia', 'CC': 'Asia', 
                'CN': 'Asia', 'CX': 'Asia', 'CY': 'Asia', 'GE': 'Asia', 'HK': 'Asia', 'ID': 'Asia', 'IL': 'Asia', 'IN': 'Asia', 'IO': 'Asia', 'IQ': 'Asia', 'IR': 'Asia', 'JO': 'Asia', 'JP': 'Asia', 'KG': 'Asia', 
                'KH': 'Asia', 'KP': 'Asia', 'KR': 'Asia', 'KW': 'Asia', 'KZ': 'Asia', 'LA': 'Asia', 'LB': 'Asia', 'LK': 'Asia', 'MM': 'Asia', 'MN': 'Asia', 'MO': 'Asia', 'MV': 'Asia', 'MY': 'Asia', 'NP': 'Asia', 
                'OM': 'Asia', 'PH': 'Asia', 'PK': 'Asia', 'QA': 'Asia', 'RU': 'Asia', 'SA': 'Asia', 'SG': 'Asia', 'SY': 'Asia', 'TH': 'Asia', 'TJ': 'Asia', 'TM': 'Asia', 'TP': 'Asia', 'TR': 'Asia', 'TW': 'Asia', 
                'UZ': 'Asia', 'VN': 'Asia', 'YE': 'Asia', 'BV': 'Atlantic Ocean', 'GS': 'Atlantic Ocean', 'SH': 'Atlantic Ocean', 'AD': 'Europe', 'AL': 'Europe', 'AT': 'Europe', 'BA': 'Europe', 'BE': 'Europe', 
                'BG': 'Europe', 'BY': 'Europe', 'CH': 'Europe', 'CZ': 'Europe', 'DE': 'Europe', 'DK': 'Europe', 'EE': 'Europe', 'ES': 'Europe', 'FI': 'Europe', 'FO': 'Europe', 'FR': 'Europe', 'FX': 'Europe', 
                'GI': 'Europe', 'GR': 'Europe', 'HR': 'Europe', 'HU': 'Europe', 'IE': 'Europe', 'IS': 'Europe', 'IT': 'Europe', 'LI': 'Europe', 'LT': 'Europe', 'LU': 'Europe', 'LV': 'Europe', 'MC': 'Europe', 
                'MD': 'Europe', 'MK': 'Europe', 'MT': 'Europe', 'NL': 'Europe', 'NO': 'Europe', 'PL': 'Europe', 'PT': 'Europe', 'RO': 'Europe', 'SE': 'Europe', 'SI': 'Europe', 'SJ': 'Europe', 'SK': 'Europe', 
                'SM': 'Europe', 'UA': 'Europe', 'GB': 'Europe', 'VA': 'Europe', 'YU': 'Europe', 'HM': 'Indian Ocean', 'AS': 'Oceania', 'AU': 'Oceania', 'CK': 'Oceania', 'FJ': 'Oceania', 'FM': 'Oceania', 
                'GU': 'Oceania', 'KI': 'Oceania', 'MH': 'Oceania', 'MP': 'Oceania', 'NC': 'Oceania', 'NF': 'Oceania', 'NR': 'Oceania', 'NU': 'Oceania', 'NZ': 'Oceania', 'PF': 'Oceania', 'PG': 'Oceania', 
                'PN': 'Oceania', 'PW': 'Oceania', 'SB': 'Oceania', 'TK': 'Oceania', 'TO': 'Oceania', 'TV': 'Oceania', 'UM': 'Oceania', 'VU': 'Oceania', 'WF': 'Oceania', 'WS': 'Oceania'}

In [38]:
import json
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Your code here:
#
def processing(df):
    # change dates in to a window open date thing
    df['time_open1'] = df['deadline'] - df['launched_at']
    # df['deadline'] = pd.to_datetime(df['deadline'], unit='s')
    # df['created_at'] = pd.to_datetime(df['created_at'], unit='s')
    # df['launched_at'] = pd.to_datetime(df['launched_at'], unit='s')
    # df['time_open'] = df['deadline'] - df['launched_at']

    # edit goal feature to be a consistent currency across
    df['goal_usd'] = df['goal'] * df['static_usd_rate']

    # after reviewing the json files, category seems decent -- just keep the slug_1
    df_cat = df.copy()
    df_cat = df_cat[['category']]
    df_cat['parsed_data'] = df_cat['category'].apply(json.loads)
    df_cat2 = pd.json_normalize(df_cat['parsed_data'])

    slug_split = df_cat2['slug'].str.split(pat = '/', expand = True)

    df_cat2[['slug_1', 'slug_2']] = slug_split
    df_cat2 = df_cat2[['slug_1']]
    df = pd.concat([df, df_cat2],axis=1)
    df = df.drop(columns=['category'], axis = 1)

    # map country into area (America on it's own)
    df['area'] = df['country'].map(country_map)

    #length of string variables
    df['blurb_len'] = df['blurb'].str.len()

    # scaler = StandardScaler()

    df_scaled = df.copy()
    df_scaled = df_scaled[['goal_usd','blurb_len']]
    # ,'time_open1'
    # df_scaled = pd.DataFrame(scaler.fit_transform(df_scaled),columns = df_scaled.columns)

    # df = df.drop(columns=['goal_usd', 'time_open1','blurb_len'], axis=1)
    # df = pd.concat([df, df_scaled], axis = 1)


    df_cat = df[['area','slug_1']]
    # ,'currency'
    df2 = pd.get_dummies(df_cat)

    # onehotencoder
    # ohe = OneHotEncoder(handle_unknown='ignore', sparse_output = False).set_output(transform='pandas')

    # ohe_slug_1 = pd.get_dummies(.fit_transform(df[['slug_1']])

    # df = pd.concat([df, ohe_slug_1], axis = 1).drop(columns = ['slug_1'], axis = 1)

    # ohe_currency = ohe.fit_transform(df[['currency']])
    # df = pd.concat([df, ohe_currency], axis = 1).drop(columns = ['currency'], axis = 1)

    # ohe_area = ohe.fit_transform(df[['area']])
    # df = pd.concat([df, ohe_area], axis = 1).drop(columns = ['area'], axis = 1)

    df2 = pd.concat([df2, df_scaled], axis =1)
    # drop columns
    #df = df.drop(columns = ['blurb','goal','country','deadline','created_at','launched_at','static_usd_rate'], axis = 1)
    return df2
#
X = processing(df)
X_eval = processing(df_eval)


In [39]:
X.head()

Unnamed: 0,area_America,area_Americas,area_Asia,area_Europe,area_Oceania,slug_1_art,slug_1_comics,slug_1_crafts,slug_1_dance,slug_1_design,...,slug_1_film & video,slug_1_food,slug_1_games,slug_1_journalism,slug_1_music,slug_1_photography,slug_1_publishing,slug_1_technology,goal_usd,blurb_len
0,True,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,125000.0,134
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,9800.0,131
2,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,2500.0,52
3,False,False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,16800.793,120
4,True,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,5500.0,125


## Part 2: Training the model

Now that we have separated our data into train and evaluation data, we can start training models and evaluating their performance. At this point, you are welcome to explore any model architecture, so long as it is a **classification** model.

Check out the `sklearn` [documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for a selection of possible models and implementation examples.

Note that most `sklearn` models have the same interface. Once imported you can create an instantiation of your model (specifying custom settings as you see fit), and assign it to a variable. For instance:

``` Python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
```

Once you have created your `model` variable, you can call `.fit()` and pass `X` and `y` as arguments.

**💡 For KATE to work, your model must be assigned to a variable called `model`**

**NOTE**: Since with this project your model will be trained directly on KATE, it is limited to models that can be trained under 1min. You will receive a `TimeoutError` if your model takes too long.


In [40]:
# Your code here:

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='lbfgs', max_iter = 1000)
lr.fit(X, y)

print(f"Accuracy score for Logistic Regression model: {lr.score(X, y)}")


from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X, y)
print(f"Accuracy score for GaussianNB  model: {gnb.score(X, y)}")

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=40, min_samples_split=80)
model.fit(X, y)
print(f"Accuracy score for DecisionTreeClassifier  model: {model.score(X, y)}")


# from sklearn.ensemble import RandomForestClassifier 
# rfc = RandomForestClassifier()
# rfc.fit(X, y)
# print(f"Accuracy score for DecisionTreeClassifier  model: {rfc.score(X, y)}")





Accuracy score for Logistic Regression model: 0.5694
Accuracy score for GaussianNB  model: 0.51534
Accuracy score for DecisionTreeClassifier  model: 0.69864


Once trained, we can use the `.score()` function to evaluate our model's performance on the train set. Remember to pass `X` and `y` as arguments.

In [41]:
# Your code here:
print(f"Accuracy score for model: {model.score(X, y)}")

Accuracy score for model: 0.69864


In [42]:
from sklearn.metrics import accuracy_score, f1_score

predictions = model.predict(X)
# Your code here...
acc = accuracy_score(y, predictions)
f1 = f1_score(y, predictions)
              
print(f'Accuracy: {acc}')
print(f'F1: {f1}')


Accuracy: 0.69864
F1: 0.7010554717879534


## Part 3: Making predictions

Now that our model is trained, we can use the `.predict()` function to make predictions for the rows in our data where `y` is not known.
 - Call `.predict()` on the `model` variable, and pass `X_eval`
 - Assign the output of `.predict()` to a variable called `y_pred`

In [43]:
# Your code here:
y_pred = model.predict(X_eval)

Note that in the previous exercise, we used the `.score()` function to evaluate our model on the training data. However, we do not have the ground truth for our `X_eval` data points - to see how well the model performs on the test set, you will have to submit it to **KATE**!