# Text Classification Assessment

This assessment is a text classification project where the goal is to classify the genre of a movie based on its characteristics, primarily the text of the plot summarization. You have a training set of data that you will use to identify and create your best predicting model. Then you will use that model to predict the classes of the test set of data. We will compare the performance of your predictions to your classmates using the F1 Score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

The **movie_train.csv** dataset contains information (`Release Year`, `Title`, `Plot`, `Director`, `Cast`) about 10,682 movies and the label of `Genre`. There are 9 different genres in this data set, so this is a multiclass problem. You are expected to primarily use the plot column, but can use the additional columns as you see fit.

After you have identified yoru best performing model, you will create predictions for the test set of data. The test set of data, contains 3,561 movies with all of their information except the `Genre`. 

Below is a list of tasks that you will definitely want to complete for this challenge, but this list is not exhaustive. It does not include any tasks around handling class imbalance or about how to test multiple different models and tuning hyperparameters, but you should still look at doing those to see if they help you to create a better predictive model.

**Deliverables:** 
    For this project you will have two things you need to create, your predictions on the holdoutset and a notebook detailing your process.  


# Good Luck

### Task #1: Perform imports and load the dataset into a pandas DataFrame


In [1]:
import pandas as pd
import numpy as np

import string

import nltk
from nltk.corpus import stopwords

In [None]:
nltk.download("stopwords")

In [2]:
df = pd.read_csv('movie_train.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
10281,1984,Silent Madness,A computer error leads to the accidental relea...,Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror
7341,1960,Desire in the Dust,"Lonnie Wilson (Ken Scott), the son of a sharec...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama
10587,1986,On the Edge,"A gaunt, bushy-bearded, 44-year-old Wes Holman...",Rob Nilsson,"Bruce Dern, Pam Grier",drama
25495,1988,Ram-Avtar,Ram and Avtar are both childhood best friends....,Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama
16607,2013,Machete Kills,Machete Cortez (Danny Trejo) and Sartana River...,Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action


### Task #2: Check for missing values:

In [4]:
df.info()

#there aren't any

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10682 entries, 10281 to 3583
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Release Year  10682 non-null  int64 
 1   Title         10682 non-null  object
 2   Plot          10682 non-null  object
 3   Director      10682 non-null  object
 4   Cast          10513 non-null  object
 5   Genre         10682 non-null  object
dtypes: int64(1), object(5)
memory usage: 584.2+ KB


### Task #3: Remove NaN values:

In [5]:
df['Plot Len'] = df['Plot'].str.len()

In [6]:
df.sort_values('Plot Len')

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre,Plot Len
24876,1973,Chhalia,Chhalia is a family thriller.,Mukul Dutt,"Navin Nischol, Nanda, Shatrughan Sinha",action,29
6210,1954,They Were So Young,"""A beach near Rio de Janeiro"".",Kurt Neumann,"Raymond Burr, Scott Brady, Johanna Matz",drama,30
7382,1960,Noose for a Gunman,A gunman takes on a corrupt land baron.,Edward L. Cahn,"Jim Davis, Lyn Thomas",western,39
4075,1945,A Medal for Benny,The film examines small town hypocrisy.,Irving Pichel,"Dorothy Lamour, Arturo de Córdova",drama,39
25919,1997,Suraj,Suraj is an Action film for Mithun Fans.,T.L.V. Prasad,"Rakesh Bedi, Mithun Chakraborty, Puneet Issar",action,40
...,...,...,...,...,...,...,...
31250,2014,Anjaan,A handicapped man named Krishna (Suriya) arriv...,Lingusamy,"Suriya, Samantha, Vidyut Jamwal, Manoj Bajpai",action,14242
3592,1943,Isle of Forgotten Sins,Somewhere on one of the English-speaking South...,Edgar G. Ulmer,"Gale Sondergaard, John Carradine",adventure,15046
3009,1941,Broadway Limited,"Following the screening of her latest film ""Th...",Gordon Douglas,"Victor McLaglen, Patsy Kelly, ZaSu Pitts",comedy,16517
23223,1987,Sworn Brothers,"When Lam Ting-yat was little, his father died ...",David Lai,"Andy Lau, Cheung Kwok Keung",crime,16636


In [7]:
df.groupby('Genre').count()

Unnamed: 0_level_0,Release Year,Title,Plot,Director,Cast,Plot Len
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
action,830,830,830,830,823,830
adventure,331,331,331,331,329,331
comedy,2724,2724,2724,2724,2703,2724
crime,328,328,328,328,326,328
drama,3770,3770,3770,3770,3673,3770
horror,840,840,840,840,810,840
romance,649,649,649,649,644,649
thriller,685,685,685,685,680,685
western,525,525,525,525,525,525


### Task #4: Take a look at the columns and do some EDA to familiarize yourself with the data. This will consists of you cleaning up the data set by doing things like removing stop words, tokenizing, and/or lemitizing words. 

#### Replace Hyphens with Spaces:

In [8]:
for i in df.index:
    df['Plot'][i] = df['Plot'][i].replace("-", " ")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Plot'][i] = df['Plot'][i].replace("-", " ")


In [9]:
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre,Plot Len
10281,1984,Silent Madness,A computer error leads to the accidental relea...,Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror,175
7341,1960,Desire in the Dust,"Lonnie Wilson (Ken Scott), the son of a sharec...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama,1374
10587,1986,On the Edge,"A gaunt, bushy bearded, 44 year old Wes Holman...",Rob Nilsson,"Bruce Dern, Pam Grier",drama,839
25495,1988,Ram-Avtar,Ram and Avtar are both childhood best friends....,Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama,857
16607,2013,Machete Kills,Machete Cortez (Danny Trejo) and Sartana River...,Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action,4428


#### Tokenize:

In [10]:
for i in df.index:
    df['Plot'][i] = df['Plot'][i].split()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Plot'][i] = df['Plot'][i].split()


In [11]:
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre,Plot Len
10281,1984,Silent Madness,"[A, computer, error, leads, to, the, accidenta...",Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror,175
7341,1960,Desire in the Dust,"[Lonnie, Wilson, (Ken, Scott),, the, son, of, ...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama,1374
10587,1986,On the Edge,"[A, gaunt,, bushy, bearded,, 44, year, old, We...",Rob Nilsson,"Bruce Dern, Pam Grier",drama,839
25495,1988,Ram-Avtar,"[Ram, and, Avtar, are, both, childhood, best, ...",Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama,857
16607,2013,Machete Kills,"[Machete, Cortez, (Danny, Trejo), and, Sartana...",Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action,4428


#### Remove Capitals:

In [12]:
for i in df.index:
    df['Plot'][i] = [word.lower() for word in df['Plot'][i]]
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Plot'][i] = [word.lower() for word in df['Plot'][i]]


In [13]:
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre,Plot Len
10281,1984,Silent Madness,"[a, computer, error, leads, to, the, accidenta...",Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror,175
7341,1960,Desire in the Dust,"[lonnie, wilson, (ken, scott),, the, son, of, ...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama,1374
10587,1986,On the Edge,"[a, gaunt,, bushy, bearded,, 44, year, old, we...",Rob Nilsson,"Bruce Dern, Pam Grier",drama,839
25495,1988,Ram-Avtar,"[ram, and, avtar, are, both, childhood, best, ...",Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama,857
16607,2013,Machete Kills,"[machete, cortez, (danny, trejo), and, sartana...",Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action,4428


#### Remove Punctuation:

In [14]:
for i in df.index:
    df['Plot'][i] = [word.translate(str.maketrans('', '', string.punctuation)) for word in df['Plot'][i]]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Plot'][i] = [word.translate(str.maketrans('', '', string.punctuation)) for word in df['Plot'][i]]


In [15]:
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre,Plot Len
10281,1984,Silent Madness,"[a, computer, error, leads, to, the, accidenta...",Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror,175
7341,1960,Desire in the Dust,"[lonnie, wilson, ken, scott, the, son, of, a, ...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama,1374
10587,1986,On the Edge,"[a, gaunt, bushy, bearded, 44, year, old, wes,...",Rob Nilsson,"Bruce Dern, Pam Grier",drama,839
25495,1988,Ram-Avtar,"[ram, and, avtar, are, both, childhood, best, ...",Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama,857
16607,2013,Machete Kills,"[machete, cortez, danny, trejo, and, sartana, ...",Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action,4428


#### Stopword Removal: 

In [16]:
for i in df.index:
    df['Plot'][i] = [word for word in df['Plot'][i] if word not in stopwords.words('english')]
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Plot'][i] = [word for word in df['Plot'][i] if word not in stopwords.words('english')]


In [17]:
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre,Plot Len
10281,1984,Silent Madness,"[computer, error, leads, accidental, release, ...",Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror,175
7341,1960,Desire in the Dust,"[lonnie, wilson, ken, scott, son, sharecropper...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama,1374
10587,1986,On the Edge,"[gaunt, bushy, bearded, 44, year, old, wes, ho...",Rob Nilsson,"Bruce Dern, Pam Grier",drama,839
25495,1988,Ram-Avtar,"[ram, avtar, childhood, best, friends, differe...",Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama,857
16607,2013,Machete Kills,"[machete, cortez, danny, trejo, sartana, river...",Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action,4428


#### Stemming:

In [18]:
p_stemmer = nltk.stem.PorterStemmer()
s_stemmer = nltk.stem.SnowballStemmer(language="english")

In [19]:
for i in df.index:
    df['Plot'][i] = [p_stemmer.stem(word) for word in df['Plot'][i]]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Plot'][i] = [p_stemmer.stem(word) for word in df['Plot'][i]]


In [23]:
for i in df.index:
    df['Plot Len'][i] = len(df['Plot'][i])


df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Plot Len'][i] = len(df['Plot'][i])


Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre,Plot Len
10281,1984,Silent Madness,"[comput, error, lead, accident, releas, homici...",Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror,17
7341,1960,Desire in the Dust,"[lonni, wilson, ken, scott, son, sharecropp, z...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama,142
10587,1986,On the Edge,"[gaunt, bushi, beard, 44, year, old, we, holma...",Rob Nilsson,"Bruce Dern, Pam Grier",drama,85
25495,1988,Ram-Avtar,"[ram, avtar, childhood, best, friend, differ, ...",Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama,81
16607,2013,Machete Kills,"[machet, cortez, danni, trejo, sartana, rivera...",Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action,437


#### Lemmatizing: 

In [24]:
lemmatizer = nltk.stem.WordNetLemmatizer()

### Task #5: Split the data into train & test sets:

Yes we have a holdout set of the data, but you do not know the genres of that data, so you can't use it to evaluate your models. Therefore you must create your own training and test sets to evaluate your models. 

### Task #6: Build a pipeline to vectorize the date, then train and fit your models.
You should train multiple types of models and try different combinations of the tuning parameters for each model to obtain the best one. You can use the SKlearn functions of GridSearchCV and Pipeline to help automate this process.


### Task #7: Run predictions and analyze the results on the test set to identify the best model.  

### Task #8: Refit the model to all of your data and then use that model to predict the holdout set. 

### #9: Save your predictions as a csv file that you will send to the instructional staff for evaluation. 

## Great job!