# Assignment 6

In [1]:
# Import libraries needed for this lab
from hashlib import sha1

import altair as alt
import graphviz
import numpy as np
import pandas as pd

from sklearn import tree
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import make_column_transformer 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import (
    FunctionTransformer,
    Normalizer,
    OneHotEncoder,
    StandardScaler,
    normalize,
    scale)
from sklearn.svm import SVC

import test_assignment6 as t
#alt.renderers.enable('mimetype')
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## 1. Introducing and Exploring the dataset <a name="1"></a>
<hr>


In this assignment you will be working with [the Olympics Games DataSet](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis).

Our problem is to predict the medal type of each example. 
 You can find more information on the dataset and features [here](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis).


*Note that many popular datasets have sex as a feature where the possible values are male and female. This representation reflects how the data were collected and is not meant to imply that, for example, gender is binary.*


The following starter code preprocesses the data to get rid of rows with `NaN` values in the target column `Medal`.

In [3]:
medal_df = pd.read_csv("data/athlete_events.csv")
medal_df = medal_df.dropna(subset=['Medal'])

**Question 1.1** <br> {points: 1}  

In order to avoid violating the golden rule, before we do anything with the data, let's split it.

Split the data into `train_df` (80%) and `test_df` (20%). 

Keep the target column (`Medal`) in the splits so that we can use it in EDA. 

Make sure to set `random_state=123`. 


In [5]:
train_df, test_df =  train_test_split(medal_df, test_size=0.20, random_state=123)

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [7]:
t.test_1_1(train_df,test_df)

'Success'

**Question 1.2** <br> {points: 1}  

How many examples are there in our training data? 

Save your answer in an object named `training_size`.

In [9]:
training_size = 31826

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer
training_size

31826

In [11]:
t.test_1_2(training_size)

'Success'

**Question 1.3** <br> {points: 3}  

Let's examine our `train_df` a bit. 

What is the youngest and oldest age of an athlete that won a medal in the Olympics?

Save the results in objects `youngest_age` and `oldest_age`. 


In [13]:
youngest_age = 10
oldest_age = 73

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [15]:
# check that the variable exists
assert 'oldest_age' in globals(
), "Please make sure that your solution is named 'oldest_age'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

In [17]:
t.test_1_3_2(youngest_age)

'Success'

**Question 1.4** <br> {points: 1}  

Look at the column dtypes using `.info()`.

How many non numeric **features** are there? 

Save the results in an object named `num_cat_feats`.

In [19]:
num_cat_feats = 9

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [21]:
t.test_1_4(num_cat_feats)

'Success'

**Question 1.5** <br> {points: 3}  

Let's take a look at some of the columns and the categories within them. 

Use `.describe` to answer the following questions. Save the describe dataframe in an object named `describe_df`.  

a) Which categorical feature has the most unique values? Save this in an object named `most_unique`. 

b) How many binary columns are there? Save this in an object named `binary_cols`. 

c) How many categorical features have missing values? Save this number in an object named `missing_cat`.



In [23]:

most_unique = 'Name'
binary_cols = 2
missing_cat = 0

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer


In [25]:
t.test_1_5_1(most_unique)

'Success'

In [27]:
t.test_1_5_2(binary_cols)

'Success'

In [29]:
t.test_1_5_3(missing_cat)

'Success'

**Question 1.6** <br> {points: 2}  

Filter or groupby the `train_df` dataframe to answer the next question. 

Which `NOC` won the most medals? Save this in an object named `most_medals`. 

Which `NOC` won the most `Gold` medals? Save this in an object named `most_gold`. 


In [31]:
most_medals = train_df['NOC'].value_counts().idxmax()
most_gold = train_df[train_df['Medal'] == 'Gold']['NOC'].value_counts().idxmax()
(most_medals, most_gold)
# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer


('USA', 'USA')

In [33]:
t.test_1_6_1(most_medals)

'Success'

In [35]:
t.test_1_6_2(most_gold)

'Success'

We are going to separate feature vectors from the targets.

We are only going to use the folowing columns:

- `Sex`
- `Age`
- `Height`
- `Weight`
- `NOC`
- `Year`
- `Season`
- `City`
- `Sport`


and using `Medal` as the target column. 

We've created  `X_train`, `y_train`, `X_test`, `y_test` for you. 

In [37]:
X_train = train_df.drop(columns=['ID', 'Name', 'Team', 'Event','Medal', 'Games'])
y_train = train_df['Medal']

X_test = test_df.drop(columns=['ID', 'Name', 'Team', 'Event','Medal', 'Games'])
y_test = test_df['Medal']

X_train.head()

Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,City,Sport
221047,M,21.0,174.0,60.0,IND,1964,Summer,Tokyo,Hockey
222232,M,28.0,170.0,70.0,SWE,1980,Summer,Moskva,Wrestling
122592,F,24.0,162.0,56.0,GDR,1976,Summer,Montreal,Diving
14077,F,24.0,172.0,67.0,CHN,2008,Summer,Beijing,Fencing
222715,M,30.0,,,TCH,1948,Winter,Sankt Moritz,Ice Hockey


## 2. Preprocessing and building your pipelines

**Question 2.1** <br> {points: 4}  

Before you can start preprocessing the data, you need to identify the binary, categorical, ordinal and numeric columns in your `X_train` and build lists of each feature type. 


Save the column names in lists named  `numeric_feats`, `binary_feats`, `categorical_feats` and `ordinal_feat`.


In [39]:
X_train.head()

Unnamed: 0,Sex,Age,Height,Weight,NOC,Year,Season,City,Sport
221047,M,21.0,174.0,60.0,IND,1964,Summer,Tokyo,Hockey
222232,M,28.0,170.0,70.0,SWE,1980,Summer,Moskva,Wrestling
122592,F,24.0,162.0,56.0,GDR,1976,Summer,Montreal,Diving
14077,F,24.0,172.0,67.0,CHN,2008,Summer,Beijing,Fencing
222715,M,30.0,,,TCH,1948,Winter,Sankt Moritz,Ice Hockey


In [41]:
numeric_feats =  ["Age", "Height","Weight","Year"] 
binary_feats = ["Sex", "Season"] 
categorical_feats = ["NOC", "City", "Sport"] 
ordinal_feat =  [] 

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [43]:
t.test_2_1_1(numeric_feats)

'Success'

In [45]:
t.test_2_1_2(binary_feats)

'Success'

In [47]:
t.test_2_1_3(categorical_feats)

'Success'

In [49]:
t.test_2_1_4(ordinal_feat)

'Success'

**Question 2.2** <br> {points: 1}  

Ok let's start making our pipelines. Use `make_pipeline()` to make a pipeline for the numeric features called `numeric_transformer`. 

Use `SimpleImputation()` with `strategy=median`. For the second step make sure to use standardization with `StandardScaler()`.

In [51]:
numeric_transformer = Pipeline(
 steps=[("imputer", SimpleImputer(strategy="median")),
 ("scaler", StandardScaler())]
)

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [53]:
t.test_2_2(numeric_transformer)

'Success'

**Question 2.3** <br> {points: 1}  

Next, use `make_pipeline()` to make a pipeline for the categorical features called `categorical_transformer`. 

Use `SimpleImputation()` with `strategy=most_frequent`. 

Make sure to use the necessary one-hot encoding transformer with `dtype=int` and `handle_unknown="ignore"`.

In [55]:
categorical_transformer = make_pipeline(
 SimpleImputer(strategy="most_frequent"),
 OneHotEncoder(handle_unknown="ignore", dtype=int) # Set dtype=int
)

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [57]:
t.test_2_3(categorical_transformer)

'Success'

**Question 2.4** <br> {points: 1}  
  
Use `make_pipeline()` to make a pipeline for the binary features call `binary_transformer`. 

Use `SimpleImputation()` with `strategy=most_frequent`. 

Make sure to use the necessary one-hot encoding transformer with `dtype=int`.

In [59]:
binary_transformer = make_pipeline(
 SimpleImputer(strategy="most_frequent"), # Impute missing values with most fre
 OneHotEncoder(dtype=int, drop="if_binary") # One-hot encode, but drop one cate
)

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [61]:
t.test_2_4(binary_transformer)

'Success'

**Question 2.5** <br> {points: 1}  


Define a column transformer using [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) called `preprocessor` for the numerical, categorical, and remainding feature types.


In [63]:
preprocessor = make_column_transformer(
 (numeric_transformer, numeric_feats),
 (categorical_transformer, categorical_feats),
 (binary_transformer, binary_feats),
 remainder='passthrough')

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [65]:
t.test_2_5(preprocessor)

'Success'

# 3. Model Building

**Question 3.1** <br> {points: 1}  

It's important to build a dummy classifier to compare our model to. Make a `DummyClassifier` using `strategy="prior"`. 

Carry out 5-fold cross validation on `X_train` and `y_train` using ` cross_validate()`. Don't forget to include the training score. 

Save the results in a dataframe named `dummy_scores`. 

In [81]:
dummy_clf = DummyClassifier(strategy="prior") 
cv_results = cross_validate(dummy_clf, X_train, y_train, cv=5, return_train_score=True)
dummy_scores = pd.DataFrame(cv_results)
dummy_scores
# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.042001,0.013033,0.334433,0.334328
1,0.033998,0.007,0.334328,0.334355
2,0.030269,0.007998,0.334328,0.334355
3,0.026443,0.007565,0.334328,0.334355
4,0.026998,0.007006,0.334328,0.334355


In [83]:
t.test_3_1(dummy_scores)

'Success'

**Question 3.2** <br> {points: 1}  


Define a main pipeline called `main_pipe` that transforms all the different features and uses a `RandomForestClassifier` model using `random_state=77` and setting the hyperparameter `n_estimators` to 10. 

In [71]:
main_pipe = make_pipeline(
 preprocessor,
 RandomForestClassifier(n_estimators=10, random_state=77)
)
# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [73]:
t.test_3_2(main_pipe)

'Success'

**Question 3.3** <br> {points: 1}  

Perform 5 fold cross-validation on `X_train` and `y_train` using the main pipeline `main_pipe`. Make sure to set `return_train_score=True` and save the result in a dataframe called `scores_df`. 

*Note: This could take 5 minutes.*

In [85]:
# Perform 5-fold cross-validation with return_train_score=True
cv_results = cross_validate(main_pipe, X_train, y_train, cv=5, return_train_score=True)

# Save the results in a DataFrame
scores_df = pd.DataFrame(cv_results)
scores_df
# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

Unnamed: 0,fit_time,score_time,test_score,train_score
0,5.468624,0.073523,0.607289,0.934721
1,5.755074,0.081499,0.616654,0.934645
2,6.005432,0.06252,0.608798,0.93437
3,6.038594,0.061832,0.624038,0.933585
4,5.571718,0.063997,0.622781,0.93331


In [87]:
t.test_3_3(scores_df)

'Success'

**Question 3.4** <br> {points: 2}

What is the mean training and cross-validation scores? 

Save the mean training score in `mean_training_score` and the mean cross-validation score in the object named `cv_score`.

In [91]:
mean_training_score = scores_df['train_score'].mean()
cv_score =  scores_df['test_score'].mean() 

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer
print(mean_training_score, cv_score)

0.9341261908105037 0.6159117898280807


In [93]:
# check that the variable exists
assert 'cv_score' in globals(
), "Please make sure that your solution is named 'cv_score'"

assert 'mean_training_score' in globals(
), "Please make sure that your solution is named 'mean_training_score'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 3.5** <br> {points: 1}

Is the model overfitting or underfitting? 

A) Overfitting

B) Underfitting

C) Neither

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_05`.*

In [95]:
answer3_05 = 'A'

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer
answer3_05

'A'

In [97]:
t.test_3_5(answer3_05)

'Success'

**Question 3.6** <br> {points: 1}

Which model performed better?

A) `RandomForestClassifier`

B) `DummyClassifier`

C) Neither

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer3_06`.*

In [101]:
answer3_06 = 'A'

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer
answer3_06

'A'

In [103]:
t.test_3_6(answer3_06)

'Success'

**Question 3.7** <br> {points: 1}  
Now that we have our pipelines and a model let's tune the hyperparameter `max_depth`. 

Sweep over the hyperparameters in `param_grid` using `RandomizedSearchCV` with a  `cv=5`, `n_iter=5` and setting `return_train_score=True`. Don't forget to set `random_state=77`.

Save your grid search in an object named `depth_search`. 

You may also want to set `verbose=2` since it may take some time. 

Don't forget to fit `depth_search`.


In [107]:
param_grid = {
    "randomforestclassifier__max_depth": list(range(1, 151, 10))
}

# Set up RandomizedSearchCV
depth_search = RandomizedSearchCV(
    main_pipe,  # Use your pipeline with the preprocessor and classifier
    param_grid,  # The hyperparameter grid to search
    cv=5,  # 5-fold cross-validation
    verbose=2,  # Show detailed output
    n_iter=5,  # Number of random iterations to try
    random_state=77,  # For reproducibility
    return_train_score=True  # Return train scores as well
)

# Fit the model
depth_search.fit(X_train, y_train)

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer


Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] END ..............randomforestclassifier__max_depth=141; total time=   5.7s
[CV] END ..............randomforestclassifier__max_depth=141; total time=   5.2s
[CV] END ..............randomforestclassifier__max_depth=141; total time=   5.9s
[CV] END ..............randomforestclassifier__max_depth=141; total time=   5.4s
[CV] END ..............randomforestclassifier__max_depth=141; total time=   5.8s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   4.5s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   4.7s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   4.9s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   4.4s
[CV] END ...............randomforestclassifier__max_depth=61; total time=   4.6s
[CV] END ...............randomforestclassifier__max_depth=21; total time=   1.0s
[CV] END ...............randomforestclassifier__m

In [108]:
t.test_3_7(depth_search)

'Success'

**Question 3.8** <br> {points: 1}  

Obtain the results for cross validation from grid search using `depth_search.cv_results_`.

Select the columns:

- `mean_test_score`
- `param_randomforestclassifier__max_depth`
- `mean_fit_time`
- `rank_test_score`

Sort your values in ascending order of `rank_test_score`. 

Make sure to save it as a dataframe and display it. Save this as an object named `grid_results`.

In [111]:
grid_results = pd.DataFrame(depth_search.cv_results_)[
    ["mean_test_score", "param_randomforestclassifier__max_depth", "mean_fit_time", "rank_test_score"]
].sort_values(by="rank_test_score", ascending=True)

grid_results

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

Unnamed: 0,mean_test_score,param_randomforestclassifier__max_depth,mean_fit_time,rank_test_score
0,0.615912,141,5.65263,1
3,0.614404,91,5.599773,2
1,0.608685,61,4.642264,3
4,0.571608,31,1.881619,4
2,0.516936,21,1.009707,5


In [113]:
t.test_3_8(grid_results)

'Success'

**Question 3.9** <br> {points: 1} 

What is the best hyperparameter value for `n_estimators`? Save it in an object named `best_depth`. 

What was the corresponding validation score for it? Save this in an object named `best_depth_score`. 

*Hint: `.best_params_`  and `.best_score_` are helpful here.* 

In [115]:
best_depth = depth_search.best_params_["randomforestclassifier__max_depth"] 

best_depth_score = depth_search.best_score_	

(best_depth, best_depth_score)

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

(141, 0.6159117898280807)

In [117]:
t.test_3_9(depth_search, best_depth, best_depth_score)

'Success'

# 4. Evaluating on the test set <a name="5"></a>
<hr>

Now that we have a best performing model, it's time to assess our model on the set aside test set. 

**Question 4.1** <br> {points: 2} 

What is the training score of the best scoring model? Save the result in an object named `train_score`. 

In [121]:
train_score = depth_search.cv_results_["mean_train_score"][depth_search.best_index_]
# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [123]:
assert 'train_score' in globals(
), "Please make sure that your solution is named 'train_score'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 4.2** <br> {points: 1} 


What is the test score of the best model? 

Score the best model from `depth_search` on `X_test` and `y_test`. 

Save the result in an object named `test_score`. 


In [125]:
test_score = depth_search.best_estimator_.score(X_test, y_test)
# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [127]:
t.test_4_2(test_score)

'Success'

# 5. Text Data

Let's develop our own SMS spam filtering system using Kaggle's [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset) that was originally referenced from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). 

We will use `CountVectorizer` to encode text messages and `SVC` for classification. 

**Sorry for the offensive language in some text messages; it's the reality of such platforms. If you are sensitive to such language try not to read the raw messages.** 

In [131]:
sms_df = pd.read_csv("data/spam.csv", encoding="latin-1")
sms_df = sms_df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})

In [133]:
sms_df.shape

(5572, 2)

**Question 5.1** <br> {points: 1}  

Split `sms_df` into train (80%) and test splits (20%) setting `random_state=123`. 
Name your objects `text_train_df` and `text_test_df`. 
Examine the first few rows of the train portion. 

In [135]:
text_train_df, text_test_df = train_test_split(sms_df,test_size=0.20,random_state=123)

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [137]:
t.test_5_1(text_train_df, text_test_df)

'Success'

**Question 5.2** <br> {points: 1}  

Split both `text_train_df` and `text_test_df` into the target and feature columns. Here,  `target` is the target column (`y`) and `sms` is the column in your `X`. 
    
Name your objects `X_text_train`, `y_text_train` and  `X_text_test` `y_text_test`.

*Hint: Make sure that you are using single brackets (a Pandas Series) for your target (y) objects. The tests will not pass unless your y variables are of type Pandas Series. This can be done by selecting the column target with single square brackets.*

In [139]:
X_text_train = text_train_df["sms"]
y_text_train = text_train_df["target"]

X_text_test = text_test_df["sms"]
y_text_test = text_test_df["target"]

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [141]:
t.test_5_2(X_text_train, X_text_test, y_text_train, y_text_test)

'Success'

**Question 5.3** <br> {points: 2}  

Note that in case of text data, the usual EDA is not applicable. In this question will carry out some simple EDA to get a sense of the data.  

What's the label distribution in the target column (How many `ham` and how many `spam` values do you have in the column `target`) in the training set? 

Save the result in an object named `target_freq`.

*Hint: There is function that will give us the frequency of each category in a column.*

In [143]:
target_freq = y_text_train.value_counts()

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [145]:
assert 'target_freq' in globals(
), "Please make sure that your solution is named 'target_freq'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 5.4** <br> {points: 1} 

What's the average length in characters of text messages? Save the value to the nearest whole value in an object named `avg_text`. 

*Hint: `str.len()` may come in handy here.* 

In [147]:
X_text_train = text_train_df["sms"]
avg_text = (X_text_train.str.len().mean())

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [149]:
t.test_5_4(avg_text)

'Success'

**Question 5.5** <br> {points: 1} 

Would you classify `sms` column as a categorical column? Does it make sense to carry out one-hot encoding on this column?

A) It is a categorical column and I would carry out one-hot encoding on this column.

B) It is a categorical column and I would **NOT** carry out one-hot encoding on this column.

C) It is a free text column and I would carry out one-hot encoding on this column.

D) It is a free text column and I would **NOT** carry out one-hot encoding on this column.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_05`.*

In [153]:
answer5_05 = 'D'

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer
answer5_05

'D'

In [155]:
t.test_5_5(answer5_05)

'Success'

**Question 5.6** <br> {points: 0}  
Import `CountVectorizer` from the appropriate library. 

In [161]:
from sklearn.feature_extraction.text import CountVectorizer
# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [163]:
t.test_5_6()

'Success'

**Question 5.7** <br> {points: 1} 

Transform the training data using `CountVectorizer` with default parameters. Create an object named `vec`, fit it on `X_text_train` and `y_text_train` and transform `X_text_train`. 

Save the newly transformed `X_text_train` in an object named `transformed_X_train`. 

In [169]:
vec = CountVectorizer()
transformed_X_train = vec.fit_transform(X_text_train)

In [171]:
t.test_5_7(transformed_X_train)

'Success'

**Question 5.8** <br> {points: 1} 

How many features have been created to represent each text message? 

Save the value in an object named `vocab_size`.

In [173]:
vocab_size = transformed_X_train.shape[1]

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [175]:
t.test_5_8(vocab_size)

'Success'

**Question 5.9** <br> {points: 2} 

What does each feature represent and each feature value represent? 

A) A word in the corpus with the value representing the number of times the word occurs in the given text message.

B) A text message in the corpus with the value representing the distance from the closest text in the corpus.

C) An example in the corpus with the value representing the length of the text message.

D) A sentence in the corpus with the value representing the number of times the sentence occurs in the given text message.

*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_09`.*

In [177]:
answer5_09 = 'A'

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer
answer5_09

'A'

In [179]:
assert 'answer5_09' in globals(
), "Please make sure that your solution is named 'answer5_09'"
# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 5.10** <br> {points: 1} 

Build a pipeline named `dummy_pipe` for feature extraction using `CountVectorizer` with `binary=True` and `DummyClassifier` with strategy equal to `most_frequent`.

Use `cross_validate()`setting `cv=5` with `dummy_pipe` and set `return_train_score=True` on `X_text_train` and `y_text_train` to obtain the train and test scores. 

Save this in a dataframe named `dummy_scores`. 

In [181]:
# Create the pipeline with CountVectorizer and DummyClassifier
dummy_pipe = Pipeline([
    ('vectorizer', CountVectorizer(binary=True)),  # Convert text to binary bag-of-words
    ('classifier', DummyClassifier(strategy='most_frequent'))  # Predicts the most frequent class
])

# Perform cross-validation
dummy_scores = cross_validate(
    dummy_pipe, X_text_train, y_text_train, cv=5, return_train_score=True
)

# Convert the results to a DataFrame
import pandas as pd
dummy_scores = pd.DataFrame(dummy_scores)

# Display the first few rows
dummy_scores.head()

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.106825,0.029726,0.862108,0.862272
1,0.109034,0.021029,0.862108,0.862272
2,0.096855,0.024894,0.863075,0.86203
3,0.112272,0.029735,0.861953,0.862311
4,0.11617,0.024024,0.861953,0.862311


In [183]:
t.test_5_10(dummy_pipe, dummy_scores)

'Success'

**Question 5.11** <br> {points: 1} 

What are the mean values of the columns in `dummy_scores`? Save this in an object named `dummy_scores_mean`

In [185]:
dummy_scores_mean = dummy_scores.mean()

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [187]:
t.test_5_11(dummy_scores_mean)

'Success'

**Question 5.12** <br> {points: 1} 

Very often representing your free text feature values in a binary format works better in practice than the default one and so we are going with that. 

Now build a pipeline named `svc_pipe_binary` for feature extraction using `CountVectorizer` with `binary=True` and `SVC` with default hyperparameters. Make sure you are using `make_pipeline()` for this. 

Cross validate on `svc_pipe_binary` using `X_text_train` and `y_text_train` and setting `cv=5`  and `return_train_score=True`.  

Save the results in a dataframe named `svc_scores`. 

In [189]:
# Build the pipeline using make_pipeline
svc_pipe_binary = make_pipeline(
 CountVectorizer(binary=True), # Binary Bag-of-Words feature extraction
 SVC() # Support Vector Classifier with default hyperparameters
)
# Perform cross-validation
svc_scores = cross_validate(
 svc_pipe_binary, X_text_train, y_text_train, cv=5, return_train_score=True
)
# Convert the results to a DataFrame
svc_scores = pd.DataFrame(svc_scores)
# Display the first few rows of the DataFrame
svc_scores.head()
# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

Unnamed: 0,fit_time,score_time,test_score,train_score
0,1.446168,0.305284,0.977578,0.995512
1,1.260162,0.265013,0.986547,0.995792
2,1.40213,0.329017,0.975309,0.995794
3,1.429047,0.323612,0.977553,0.996915
4,1.367254,0.263134,0.978676,0.995794


In [190]:
t.test_5_12(svc_scores)

'Success'

**Question 5.13** <br> {points: 1} 

What are the mean values of the columns in `svc_scores`? Save this in an object named `svc_scores_mean`.

In [191]:
svc_scores_mean = svc_scores.mean()

# your code here
# raise NotImplementedError # No Answer - remove if you provide an answer

In [192]:
t.test_5_13(svc_scores_mean)

'Success'

**Question 5.14** <br> {points: 1} 

Are you getting better results with `SVC` compared to `DummyClassifier`?

A) I am getting better results with `SVC`.

B) I am getting better results with `DummyClassifier`.


*Answer in the cell below using the uppercase letter associated with your answer. Place your answer between `""`, assign the correct answer to an object called `answer5_14`.*

In [None]:
answer5_14 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
answer5_14

In [None]:
t.test_5_14(answer5_14)

## Attributions
- The Olympics Games DataSet - [Kaggle](https://www.kaggle.com/samruddhim/olympics-althlete-events-analysis)

- The SMS Spam Collection Dataset - [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset) and [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)

    *Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011*


Before submitting your assignment please do the following:

- Read through your solutions
- Goto the File --> Save the notebook as --> PDF (or you can use HTML as well)
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success" 