# Text Classification Assessment

This assessment is a text classification project where the goal is to classify the genre of a movie based on its characteristics, primarily the text of the plot summarization. You have a training set of data that you will use to identify and create your best predicting model. Then you will use that model to predict the classes of the test set of data. We will compare the performance of your predictions to your classmates using the F1 Score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

The **movie_train.csv** dataset contains information (`Release Year`, `Title`, `Plot`, `Director`, `Cast`) about 10,682 movies and the label of `Genre`. There are 9 different genres in this data set, so this is a multiclass problem. You are expected to primarily use the plot column, but can use the additional columns as you see fit.

After you have identified yoru best performing model, you will create predictions for the test set of data. The test set of data, contains 3,561 movies with all of their information except the `Genre`. 

Below is a list of tasks that you will definitely want to complete for this challenge, but this list is not exhaustive. It does not include any tasds around handling class imbalance or about how to test multiple different models and their tuning parameters, but you should still look at doing those to see if they help you to create a better predictive model.


# Good Luck

### Task #1: Perform imports and load the dataset into a pandas DataFrame


In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('movie_train.csv', index_col=0)

In [4]:
df.head()

Unnamed: 0,Release Year,Title,Plot,Director,Cast,Genre
10281,1984,Silent Madness,A computer error leads to the accidental relea...,Simon Nuchtern,"Belinda Montgomery, Viveca Lindfors",horror
7341,1960,Desire in the Dust,"Lonnie Wilson (Ken Scott), the son of a sharec...",Robert L. Lippert,"Raymond Burr, Martha Hyer, Joan Bennett",drama
10587,1986,On the Edge,"A gaunt, bushy-bearded, 44-year-old Wes Holman...",Rob Nilsson,"Bruce Dern, Pam Grier",drama
25495,1988,Ram-Avtar,Ram and Avtar are both childhood best friends....,Sunil Hingorani,"Sunny Deol, Anil Kapoor, Sridevi",drama
16607,2013,Machete Kills,Machete Cortez (Danny Trejo) and Sartana River...,Robert Rodriguez,"Danny Trejo, Michelle Rodriguez, Sofía Vergara...",action


### Task #2: Check for missing values:

In [8]:
# Check for whitespace strings (it's OK if there aren't any!):
df.isnull().sum()

Release Year      0
Title             0
Plot              0
Director          0
Cast            169
Genre             0
dtype: int64

In [9]:
df.Genre.value_counts()

drama        3770
comedy       2724
horror        840
action        830
thriller      685
romance       649
western       525
adventure     331
crime         328
Name: Genre, dtype: int64

In [None]:
slemma

stop_words and words not in punctuations

In [13]:
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

In [21]:
punctuations = string.punctuation

nlp = spacy.load('en_core_web_sm')
# stop_words = spacy.lang.en.stop_words.STOP_WORDS
stop_words = STOP_WORDS
parser = English()

In [27]:
def spacy_tokenizer(text):
    mytokens = nlp(text)
    mytokens = [ word if word.pos_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "PRON'-" else word.lower_ for word in mytokens]
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    return mytokens

In [23]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
stopwords = set(STOPWORDS)

In [28]:
df['tokenized']=df['Plot'].apply(spacy_tokenizer)

KeyboardInterrupt: 

In [24]:
for genre in list(df['Genre'].unique()):
    show_wordcloud(df[df['Genre']==genre]['tokenized'], title = genre)

NameError: name 'show_wordcloud' is not defined

In [None]:
df['category_id'] = df['Genre'].factorize()[0]
category_id_df = df[['Genre','category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'Genre']].values)

In [None]:
from sklearn.feature_selection import chi2
import numpy as np
N = 5
for Genre, category_id in sorted(category_to_id.items()):
    print(Genre, category_id)
    features_chi2 = chi2(train_features, y_train == Genre)
    indices = np.argsort()

### Task #3: Remove NaN values:

### Task #4: Take a look at the columns and do some EDA to familiarize yourself with the data. This will consists of you cleaning up the data set by doing things like removing stop words, tokenizing, and/or lemitizing words. 

### Task #5: Split the data into train & test sets:

Yes we have a holdout set of the data, but you do not know the genres of that data, so you can't use it to evaluate your models. Therefore you must create your own training and test sets to evaluate your models. 

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
X = df['Plot']
y = df['Genre']

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=13)

### Task #6: Build a pipeline to vectorize the date, then train and fit your models.
You should train multiple types of models and try different combinations of the tuning parameters for each model to obtain the best one. You can use the SKlearn functions of GridSearchCV and Pipeline to help automate this process.


In [None]:
model = GridSearchCV(estimator = nb, scoring = scoring, param_grid = params,
                     cv = 5, n_jobs = -1, verbose = 2)

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
tfidf_train = TfidfVectorizer(tokenizer = spacy_tokenizer, sublinear_tf=True, max_df=.8, min_df=10, ngram_range=(1,1))

In [35]:
train_features.mean()

NameError: name 'train_features' is not defined

In [36]:
train_features = tfidf_train.fit_transform(X_train).toarray()

KeyboardInterrupt: 

In [None]:
test_features = tfidf_train.transform(X_test).toarray()

### Task #7: Run predictions and analyze the results on the test set to identify the best model.  

### Task #8: Refit the model to all of your data and then use that model to predict the holdout set. 

### #9: Save your predictions as a csv file that you will send to the instructional staff for evaluation. 

## Great job!

In [None]:

F1
0.662748 accuracy .66835
20 models
bagged
voted