## Practical 4

This practical will mostly focus on preparing the data to build a classification model and assessing what the classifier learned. Unlike the previous practicals, there will not be tasks leading to questions for you to answer. Instead, the practical will provide a step-by-step solution on how to build and asses a classifier.

*Important note:* Please ask if you cannot answer a task during the practicals fully or feel unsure about your answer (even after the explanation). You must develop the correct intuitions for each of the points we discuss here. Sometimes they do not ‘click’ by themselves; they require repeated practice and interpretation, and not every explanation works for everyone. We will be very happy to answer all your questions on Canvas!

### Setting up for Predictions

When acquiring data yourself, there is always a clear task or goal. However, when working with real-world data, it is often your goal to discover purpose in data. Moreover, it might also be the case that you already have a particular goal in mind but are still able to find surprising patterns with some thorough analysis. Collection, cleaning, and general preparation of data cover the majority of the effort in Data Science research.

Learning about common issues in real-world data and the ways to solve them takes practice and patience. Even when dealing with toy datasets, as you have already seen, the combination of your data and tools can pose quite the obstruction on the way to your final goal: doing predictions.

There is a manifold of different issues that arise in the process described above. We will cover the following in this practical:
- Getting your data in the correct data types.
- Plotting to gain extra insight and identifying anomalies in data.
- Preprocessing to improve the information in your features.

In [2]:
import pandas as pd
df = pd.read_csv('../data/IMDB/imdb.csv', na_values='?')
df.replace({"quality": {"very-bad": 1, "bad": 2, "okay": 3, "good": 4, "very-good": 5}}, inplace=True)
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes,quality
0,Color,James-Cameron,723.0,178.0,0.0,855.0,Joel-David-Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000,4
1,Color,Gore-Verbinski,302.0,169.0,563.0,1000.0,Orlando-Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0,4
2,Color,Sam-Mendes,602.0,148.0,0.0,161.0,Rory-Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000,3
3,Color,Christopher-Nolan,813.0,164.0,22000.0,23000.0,Christian-Bale,27000.0,448130642.0,Action|Thriller,...,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000,5
4,,Doug-Walker,,,131.0,,Rob-Walker,131.0,,Documentary,...,,,,,,12.0,7.1,,0,4


### Preprocessing

Now that we have looked at some interesting characteristics of this particular dataset, we will discuss a few methods to enrich it and fix some of the errors we found. Again, please keep in mind that there are many more things that can be done. We will show a selection of them.

##### Task 1. Dealing with highly unique features 

As we have discussed in the Text-Mining-related video lectures, language is often represented as text vectors. However, what happens when we are dealing with actor names? These are not embedded in a piece of text, and we cannot split them into separate tokens (chances of the same name is an important feature are slim). However, it does contain some information, intuitively at least. When you see names such as Ryan Gosling, Dave Bautista, Jared Leto, and Harrison Ford on one poster, you could make an educated guess that there needs to be something very wrong with the movie for it to get a low IMDB score. So, we associate some quality with certain names. A quick way to convey this kind of world knowledge is replacing the actor names with the average IMDB scores for the movies they have starred in. This is quite a straightforward operation in ``pandas``:

In [3]:
df.groupby('actor_1_name')['imdb_score'].mean().sort_values(ascending=False).head(10)

actor_1_name
Krystyna-Janda        9.1
Jack-Warden           8.9
Rob-McElhenney        8.8
Kimberley-Crossman    8.7
Abigail-Evans         8.7
Elina-Abai-Kyzy       8.7
Jackie-Gleason        8.7
Takashi-Shimura       8.7
Maria-Pia-Calzone     8.7
Ruth-Wilson           8.6
Name: imdb_score, dtype: float64

However, the mean score might not be enough. We would also want to introduce some reliability scores. The way that is solved here is not the only one, but at least it considers (a) popularity is important (e.g., starring in many movies), and (b) the scores of those movies are important. Note that this metric can be offset by being old (see most of the listed actors below) AND popular. However, by using the sum over the z-scores of movies scores that an actor starred in, we are provided with some indication of how consistently good an actor is.

In [4]:
from scipy.stats.mstats import zscore

df['imdb_z'] = zscore(df['imdb_score'])
pd.Series(df.groupby('actor_1_name')['imdb_z'].sum()).sort_values(ascending=False).head(10)

actor_1_name
Tom-Hanks                 20.967048
Leonardo-DiCaprio         19.656199
Harrison-Ford             17.215400
Denzel-Washington         17.105197
Christian-Bale            15.497207
Matt-Damon                14.440725
Philip-Seymour-Hoffman    14.186358
Kevin-Spacey              13.846442
Robert-De-Niro            13.298695
Tom-Cruise                13.025713
Name: imdb_z, dtype: float64

Note that when we do not consider the number of movies, we get very different results:

In [5]:
pd.Series(df.groupby('actor_1_name')['imdb_z'].mean()).sort_values(ascending=False).head(10)

actor_1_name
Krystyna-Janda        2.361291
Jack-Warden           2.183659
Rob-McElhenney        2.094843
Takashi-Shimura       2.006028
Kimberley-Crossman    2.006028
Jackie-Gleason        2.006028
Abigail-Evans         2.006028
Elina-Abai-Kyzy       2.006028
Maria-Pia-Calzone     2.006028
Bunta-Sugawara        1.917212
Name: imdb_z, dtype: float64

##### Task 2. Dealing with text data

What to do when we have text data? While we could set up a prediction task for the movie score based on the plot alone, this dataset does not provide the rich textual representation required for a proper computational linguistics approach. Instead, we will take the keywords from the plot:

In [6]:
df['plot_keywords'].head(5)

0               avatar|future|marine|native|paraplegic
1    goddess|marriage-ceremony|marriage-proposal|pi...
2                  bomb|espionage|sequel|spy|terrorist
3    deception|imprisonment|lawlessness|police-offi...
4                                                  NaN
Name: plot_keywords, dtype: object

We will have to do two things here: split the different keywords and then convert them into word counts. Scikit-learn offers a ``CountVectorizer`` to do just this. It detects the char | out of the box as a token boundary. We only have to make sure that the missing values (NaN) are replaced by an empty string:

In [7]:
df['plot_keywords'] = df['plot_keywords'].fillna(' ')

Note that the transformation returns a sparse matrix. If, afterward, we want to view it in a ``DataFrame`` format again, it needs to be a dense matrix (or we have to do some special ``pandas`` loading procedure for sparse matrices). To restrict the words to frequent ones only, we can use ``max_features``. Afterward, you can join this word matrix to your original ``DataFrame`` if you prefer, or even do something fancier like using the ``TfidfVectorizer``. However, to keep further interpretation in this practical straightforward, we will not be doing this.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
# Convert a collection of text documents to a matrix of token counts.

cv = CountVectorizer(max_features=500)
pd.DataFrame(cv.fit_transform(df['plot_keywords']).todense(),
columns=sorted(cv.vocabulary_))

Unnamed: 0,1950s,1960s,1970s,1980s,1990s,19th,abuse,accident,action,actor,...,woods,word,worker,world,writer,written,year,york,young,zombie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5024,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5025,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5026,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5027,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


##### Task 3. Dealing with missing values

We have discussed missing values a few times, and so far, we have just replaced them with some standard value such as zero and the empty string, but how do we correctly deal with missing values in data mining settings?

In [9]:
df['gross'].head(5)

0    760505847.0
1    309404152.0
2    200074175.0
3    448130642.0
4            NaN
Name: gross, dtype: float64

First, it is important to make a few considerations:

- Does it make sense to fill in the missing values? In the case of text, what are you going to fill it with? Is it easier to fix the problem or ignore the document altogether?

- Are the missing values random? If it is just some small error, it will not affect your analysis if you delete the entries with missing values. Of course, that is supposing you have enough data.

- If the missing values are not random, you can choose a few strategies to fix them without hurting the overall vector representation too much:

    - Impute: replaces the missing values with some value inferred from the data (e.g., the mean/median feature value or the majority category).
    - Estimate: replaces the missing values with some value learned from the data (e.g., using Singular Value Decomposition, k-NN, or Naive Bayes).
    
Estimation often proves effective when doing predictions afterward, even with up to half of the values missing. However, using parameterized models also introduces more complexity in your pipeline, so it is worthwhile to consider if time is a factor and if there are enough resources (in terms of data).

#### Task 4. Imputing the missing values

Doing this process column-wise in ``pandas`` requires a bit of effort to make it compatible with scikit-learn’s ``SimpleImputer``. Scikit-learn expects a matrix, so if we only want to impute one column, we need to wrap it between brackets:

In [10]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='mean')
df[['gross']] = imp.fit_transform(df[['gross']])

As we can see, it’s done:

In [11]:
df['gross'].head(5)

0    7.605058e+08
1    3.094042e+08
2    2.000742e+08
3    4.481306e+08
4    4.854946e+07
Name: gross, dtype: float64

A simpler alternative is using ``pandas`` native function for imputing missing values:

In [12]:
df['gross'] = df['gross'].fillna(df['gross'].mean())

##### Task 5. Dealing with discrete (categorical) data

Unfortunately, there is no native way to deal with categorical data in the scikit-learn library. Notice that other machine learning libraries (e.g., Weka) do support the use of categorical data. If we want to use any scikit-learn, we need the entire dataframe to be numeric and remove all missing values. If we want to impute on most frequent categories, we have to count their occurrences, sort them by most frequent, take the top entry, and get its index value.

In [13]:
df['genres'].value_counts()

Drama                                       235
Comedy                                      209
Comedy|Drama                                189
Comedy|Drama|Romance                        186
Comedy|Romance                              158
                                           ... 
Biography|Documentary|Drama                   1
Action|Adventure|Animation|Comedy|Sci-Fi      1
Action|Drama|Sport|Thriller                   1
Crime|Documentary|Drama                       1
Adventure|Sci-Fi                              1
Name: genres, Length: 914, dtype: int64

In [14]:
df['genres'].value_counts()[:1].index[0]

'Drama'

To impute numerical and categorical features, we need a small piece of code. Pandas library has a ’category’ type that allows for easy conversion to numbers. We will use that in combination with the two other ``fillna`` lines that we discussed before to fill the missing values of all features:

In [15]:
df.info() # checking the types of columns and missing values 
# observe the object dtype and that the count of non-null (non-missing) is not the same for all columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5029 entries, 0 to 5028
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      5010 non-null   object 
 1   director_name              4926 non-null   object 
 2   num_critic_for_reviews     4981 non-null   float64
 3   duration                   5014 non-null   float64
 4   director_facebook_likes    4926 non-null   float64
 5   actor_3_facebook_likes     5006 non-null   float64
 6   actor_2_name               5016 non-null   object 
 7   actor_1_facebook_likes     5022 non-null   float64
 8   gross                      5029 non-null   float64
 9   genres                     5029 non-null   object 
 10  actor_1_name               5022 non-null   object 
 11  movie_title                5029 non-null   object 
 12  num_voted_users            5029 non-null   int64  
 13  cast_total_facebook_likes  5029 non-null   int64

In [16]:
for column, dtype in df.dtypes.to_dict().items(): # for each column, dtype pair in the dataframe
    
    if dtype == 'object': # if the column is an object (thus discrete)
        
        df[column] = df[column].fillna(df[column].value_counts()[:1].index[0]) # fill with most common
        cats = df[column].astype('category') # convert to category
        df[column] = cats.cat.codes # use the category indices to convert to numeric
        
    else: # if the column is something else (thus numeric)
        df[column] = df[column].fillna(df[column].mean()) # take the mean

In [17]:
df.info() # checking the changes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5029 entries, 0 to 5028
Data columns (total 30 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      5029 non-null   int8   
 1   director_name              5029 non-null   int16  
 2   num_critic_for_reviews     5029 non-null   float64
 3   duration                   5029 non-null   float64
 4   director_facebook_likes    5029 non-null   float64
 5   actor_3_facebook_likes     5029 non-null   float64
 6   actor_2_name               5029 non-null   int16  
 7   actor_1_facebook_likes     5029 non-null   float64
 8   gross                      5029 non-null   float64
 9   genres                     5029 non-null   int16  
 10  actor_1_name               5029 non-null   int16  
 11  movie_title                5029 non-null   int16  
 12  num_voted_users            5029 non-null   int64  
 13  cast_total_facebook_likes  5029 non-null   int64

##### Task 6. Standardizing numerical features

With all the missing values out of the way, we need to standardize our feature space. Most models benefit greatly from a Gaussian distribution, and therefore we can use scikit-learn’s StandardScaler to achieve this. While doing that, let us define the input and output data.

In [18]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
y = df.pop('quality') # returns and drops the column from the dataset
X = scaler.fit_transform(df)

Are we finally ready to make predictions now? Not quite yet.

##### Task 7. Splitting the available data into train and test

Our dataframe is ready, and we can use it to build a classification model. However, before building our selected model, we first need to set up baselines and inspect their behavior. Let us start by at least making our train/test evaluation set-up:

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

#### Task 8. Building some baseline models for comparison purposes

Since we are not doing any parameter tuning, we do not require a validation set yet. Simple classification models such as Naive Bayes or Logistic Regression (without polynomial features, kernel transformations, or parameter tuning) make good first baselines.

And, of course, we would also need a dummy majority baseline or zero-rule classifier. We can construct this simple model as follows:

In [20]:
y_baseline = [y.value_counts()[:1].index[0]] * len(y)

This is the same ’most common class’ that we used for the imputation of categories. We then create a list of the length of the initial label, filled with this value. That is our majority baseline. We can quickly evaluate its performance:

In [21]:
from sklearn.metrics import classification_report
print(classification_report(y, y_baseline))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00        41
           2       0.00      0.00      0.00       441
           3       0.55      1.00      0.71      2774
           4       0.00      0.00      0.00      1445
           5       0.00      0.00      0.00       328

    accuracy                           0.55      5029
   macro avg       0.11      0.20      0.14      5029
weighted avg       0.30      0.55      0.39      5029



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Task 9. Building and evaluating the selected classification model

Now for the selected classification model, let us use LogisticRegression:

In [24]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


We can evaluate what the model has learned as follows:

In [25]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           1       1.00      0.45      0.62        11
           2       0.93      0.93      0.93       100
           3       0.99      1.00      0.99       717
           4       0.99      0.98      0.99       351
           5       0.95      0.96      0.96        79

    accuracy                           0.98      1258
   macro avg       0.97      0.87      0.90      1258
weighted avg       0.98      0.98      0.98      1258



Well, aren’t we doing amazing? This is the point where some loud alarm bells need to start ringing. Why?

- We determined that most of the features are useless; they do not accurately reflect the ’truth’ or contain mixed information.
- The amount of information we have available is quite limited and would intuitively not be enough to guess movie quality this well (although, of course, this is speculation at this point).
- We ran a vanilla model, no tuning, no nothing, and we are doing almost perfect on the test set.

Let us see what the model is paying attention to:

In [26]:
list(zip(df.columns[:-1], lr.coef_[2])) 

[('color', -0.03066004905701737),
 ('director_name', 0.026970860139951768),
 ('num_critic_for_reviews', 0.13420271657544114),
 ('duration', 0.08215123980305461),
 ('director_facebook_likes', -0.0980873752415476),
 ('actor_3_facebook_likes', -0.09986619345211814),
 ('actor_2_name', 0.13982793492831092),
 ('actor_1_facebook_likes', -0.10832565921943706),
 ('gross', 0.06522750609433892),
 ('genres', -0.004179990908756446),
 ('actor_1_name', -0.05203343149257484),
 ('movie_title', -0.01119589722618288),
 ('num_voted_users', -0.20024115290581107),
 ('cast_total_facebook_likes', 0.33155800351316783),
 ('actor_3_name', -0.1846632220427074),
 ('facenumber_in_poster', 0.10024541236345444),
 ('plot_keywords', -0.08784301647318947),
 ('movie_imdb_link', -0.14805164570272275),
 ('num_user_for_reviews', -0.09943063504597484),
 ('language', 0.012595002030817782),
 ('country', 0.132291786264631),
 ('content_rating', 0.0949612858598149),
 ('budget', -0.051837469476016126),
 ('title_year', 0.3219000961

In [35]:
len(df.columns)

28

``LogisticRegression`` has coefficients per class, so this would be for class ’3’. Apparently 'imdb_score' is an amazing predictor for quality. Let’s see why:

In [27]:
plot = pd.concat([df['imdb_score'], y], axis=1).plot(kind='scatter', x='imdb_score', y='quality')

#### Task 10. Correcting the data and re-building the models

- What is happening here?

Hopefully, with this you will see that quality is actually based on imdb_score: 1 to 3 = 1, 3 to 5 = 2, 5 to 7 = 3, 7 to 8 = 4, 8 to 10 = 5. The model will see this and use this feature. Given that it is just a derivation from imdb_score, we do not want to use it, as it is a feature contamination/pollution case.

- Try removing the polluting feature with ``del df['imdb_score']`` or ``df.drop('imdb_score', inplace=True, axis=1)`` and run the experiments again. You should get lower scores, but the coefficients should make more sense.

In [28]:
df.drop('imdb_score', inplace=True, axis=1)

X = scaler.fit_transform(df)

X_train, X_test, y_train, y_test = train_test_split(X, y)

lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred))

list(zip(df.columns[:-1], lr.coef_[2]))

              precision    recall  f1-score   support

           1       1.00      0.71      0.83         7
           2       0.97      0.91      0.94       107
           3       0.98      0.99      0.99       664
           4       0.92      0.98      0.95       378
           5       0.96      0.72      0.82       102

    accuracy                           0.96      1258
   macro avg       0.97      0.86      0.91      1258
weighted avg       0.96      0.96      0.96      1258



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[('color', -0.01857881925191206),
 ('director_name', 0.022551170939812726),
 ('num_critic_for_reviews', 0.368592399937743),
 ('duration', 0.16837668861346128),
 ('director_facebook_likes', -0.12942345220970555),
 ('actor_3_facebook_likes', -0.04549951365487586),
 ('actor_2_name', 0.02349647515062821),
 ('actor_1_facebook_likes', -0.005669297880670421),
 ('gross', 0.0784140181192109),
 ('genres', 0.011284743537481504),
 ('actor_1_name', -0.03836222301311909),
 ('movie_title', 0.07375055043032601),
 ('num_voted_users', -0.31841922947801204),
 ('cast_total_facebook_likes', 0.2991329616692036),
 ('actor_3_name', -0.12557352990200782),
 ('facenumber_in_poster', 0.07787744457260234),
 ('plot_keywords', -0.05580220449669781),
 ('movie_imdb_link', -0.18306920089786224),
 ('num_user_for_reviews', -0.23959120424226857),
 ('language', 0.033415672707136974),
 ('country', 0.12463328627350902),
 ('content_rating', 0.08307780070286908),
 ('budget', -0.1381310979070767),
 ('title_year', 0.242421802360

##### Other performance measures

You can also measure performance by calling individual functions. For more information, see the links below:

*Accuracy*:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [29]:
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_pred, y_test)
print('accuray: ' + str(acc))

accuray: 0.958664546899841


Precision: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

In [30]:
from sklearn.metrics import precision_score
precision = precision_score(y_pred, y_test, average='weighted')
print('average precision: ' + str(precision))

average precision: 0.9651263741249789


Recall: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [31]:
from sklearn.metrics import recall_score
recall = recall_score(y_pred, y_test, average='weighted')
print('average recall: ' + str(recall))

average recall: 0.958664546899841


F1 measure: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

In [32]:
from sklearn.metrics import f1_score
f1 = recall_score(y_pred, y_test, average='weighted')
print('average f1: ' + str(f1))

average f1: 0.958664546899841
