# Actor or Movie? Classifying Wikipedia Articles

### Introduction
For this project, I used data from 20 different Wikipedia articles. The dataset includes the pages for 10 actors and 10 movies. Rather than analyze the text directly, I used part-of-speech (POS) tagging and named entity recognition (NER) tagging on word-tokenized data to obtain numerical measurements of the dataset. More specifically, I counted the amount of each tag for each actor or movie.

I used four different machine learning algorithms for classifying the articles. The algorithms I used are Naive Bayes, Decision Tree, and the ensemble method Gradient Boosting.

### About the Data Set

#### Actors
This is an informal project, so I simply chose actors I like. That said, I tried to select a broad range of actors, from B-Movies to blockbusters. My chosen actors are Tessa Thompson, Lupita Nyong'o, Winona Ryder, Amy Poehler, Bruce Campbell, Adam Scott, Nicolas Cage, Tim Curry, Margot Robbie, and Sandra Bullock.
#### Movies
For each actor above, I chose a movie they appeared in. The movies are "Sorry to Bother You", "Us", "Heathers", "Wet Hot American Summer", "Evil Dead II", "Piranha 3D", "Con Air", "The Rocky Horror Picture Show", "I, Tonya", and "Miss Congeniality."

### The Process

#### The General Idea
If the machine learning models perform well, this suggests that there are quantifiable differences between how actors and movies are written about on Wikipedia. Nota bene, this is a simple, straightforward analysis and it does not prove anything.

#### Pre-processing
1. Use the [Wikipedia library](https://pypi.org/project/wikipedia/) to gather the content of all Wikipedia pages I will use as my dataset.
2. Pre-process this data to remove citations, punctuation, numbers, other extraneous characters, and stopwords and to tokenize by word. 


#### Analysis
1. Tag the word-tokenized data by part-of-speech and by named entity.
2. Build a frequency distribution of the top three parts of speech and top three named entities
3. Use the counts from each frequency distribution to calculate (for each movie or each actor):
    * Mean
    * Maximum
    * Minimum
    * Range
    * Standard Deviation
4. Select K best features from Movies and from Actors
5. Train ML models
6. Calculate mean F1 score and mean accuracy score for each model

### Results
Naive Bayes and Gradient Boosting are tied for both the highest mean F1 score and the highest mean accuracy score (85.3% and 85%, respectively). I would thus consider these two to be the most reliable models. Decision Tree did not perform quite as well, with an F1 of 81.3% and an accuracy of 70%.

These scores indicate that there are quantifiable differences between Actor and Movie articles and that these differences can be shown effectively even with a small dataset and limited POS and NER tagging.

# Here we go!

## Imports

In [1]:
# Data manipulation and plotting imports
#import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# NLTK imports
from nltk import tokenize, pos_tag, FreqDist
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
# import Stanford NER tagger
from nltk.tag import StanfordNERTagger

In [3]:
# Define sources for Stanford NER tagger
st = StanfordNERTagger(
    '/Users/emmahighland/Desktop/Projects/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
    '/Users/emmahighland/Desktop/Projects/stanford-ner-2018-10-16/stanford-ner.jar',
    encoding='utf-8')

In [4]:
# Wikipedia library -- wrapper for MediaWiki API
import wikipedia

In [5]:
# regular expressions import and patterns
# string import to match punctuation
import re
import string
punc = string.punctuation
cite = re.compile('(\[\d+\])')
year = re.compile('\(\d+\)')
headings = re.compile('(=)+')
numbers = re.compile('\d+')

In [6]:
# sklearn ML imports
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier



### Data Gathering and Pre-Processing

##### Pre-processing function

In [7]:
def pre_proc(article):
    # Tokenize as a sentence
    sent = tokenize.sent_tokenize(''.join(article))
    # Remove citations, headers (== or ===)
    sent_cleaned = [re.sub('(\s+)',' ',w) and re.sub(cite,'',w) and re.sub(headings,'',w) for w in sent]
    
    # Tokenize by word
    words = tokenize.word_tokenize(' '.join(sent_cleaned))
    '''Clean words by removing punctuation, pluralizaton, and numbers.
    word_tokenize will list the base word and the pluralization
    as separate words (i.e. "uncle's" becomes "uncle" and "'s'").
    '''
    words_cleaned = [w for w in words if w not in punc and 
                     re.sub(numbers,'',w) and 
                     re.sub(year,'',w) and 
                     w != "'s"]
    
    # Remove stopwords
    final_words = [w for w in words_cleaned if w.lower() not in stopwords]
    
    return final_words

##### Actor Wikipedia pages

In [8]:
tessa = wikipedia.page("Tessa Thompson").content

In [9]:
lupita = wikipedia.page("Lupita Nyong'o").content

In [10]:
amy = wikipedia.page("Amy Poehler").content

In [11]:
winona = wikipedia.page("Winona Ryder").content

In [12]:
bruce = wikipedia.page("Bruce Campbell").content

In [13]:
adam = wikipedia.page("Adam Scott").content

In [14]:
nic = wikipedia.page("Nicolas Cage").content

In [15]:
curry = wikipedia.page("Tim Curry").content

In [16]:
margot = wikipedia.page("Margot Robbie").content

In [17]:
sandra = wikipedia.page("Sandra Bullock").content

##### Pre-processing actor Wikipedia data

In [18]:
actor_list = [tessa,lupita,amy,winona,bruce,adam,nic,curry,margot,sandra]
actor_words = []
for a in actor_list:
    words = pre_proc(a)
    actor_words.append(words)

##### Movie Wikipedia pages

In [19]:
stby = wikipedia.page("Sorry to Bother You").content

In [20]:
us = wikipedia.page("Us (2019 film)").content

In [21]:
heathers = wikipedia.page("Heathers").content

In [22]:
whas = wikipedia.page("Wet Hot American Summer").content

In [23]:
ev2 = wikipedia.page("Evil Dead II").content

In [24]:
piranha = wikipedia.page("Piranha 3D").content

In [25]:
conair = wikipedia.page("Con Air").content

In [26]:
rhps = wikipedia.page("The Rocky Horror Picture Show").content

In [27]:
itonya = wikipedia.page("I, Tonya").content

In [28]:
miss = wikipedia.page("Miss Congeniality").content

##### Pre-processing movie Wikipedia data

In [29]:
movie_list = [stby,us,heathers,whas,ev2,piranha,conair,rhps,itonya,miss]
movie_words = []
for m in movie_list:
    words = pre_proc(m)
    movie_words.append(words)

### Tagging

##### Function to get frequency distribution for POS and NER tags

In [30]:
def get_freqdist(my_var,tags):
    # Universal tagset has simplified parts of speech tagging
    var_pos = pos_tag(my_var,tagset='universal')
    # Get the top three parts of speech
    if tags == 'pos':
        var_fd = FreqDist(tag for (word,tag) in var_pos if tag != '.').most_common(3)
    elif tags == 'ner':
        var_fd = FreqDist(tag for (word,tag) in st.tag(my_var) if tag != 'O').most_common(3)
    return var_fd

##### Create dictionaries to store counts of POS and NER tags

In [31]:
a_pos_dict = {}
a_ner_dict = {}
actor_names = ['Tessa Thompson',
               "Lupita Nyong'o",
               'Winona Ryder',
               'Amy Poehler',
               'Bruce Campbell',
               'Adam Scott',
               'Nicolas Cage',
               'Tim Curry',
              'Margot Robbie',
              'Sandra Bullock']

In [32]:
# Set up dictionaries and 
m_pos_dict = {}
m_ner_dict = {}
movie_names = ['Sorry to Bother You',
               'Us','Heathers',
               'Wet Hot American Summer',
               'Evil Dead II',
               'Piranha 3D','Con Air',
               'The Rocky Horror Picture Show',
              'I, Tonya',
              'Miss Congeniality']

##### Frequency distributions to get counts

In [33]:
for x in range(0,10):   
    pos = get_freqdist(actor_words[x],'pos')
    ner = get_freqdist(actor_words[x],'ner')
    n = actor_names[x]
    a_pos_dict[n] = pos
    a_ner_dict[n] = ner

In [34]:
for x in range(0,10):   
    pos = get_freqdist(movie_words[x],'pos')
    ner = get_freqdist(movie_words[x],'ner')
    n = movie_names[x]
    m_pos_dict[n] = pos
    m_ner_dict[n] = ner

### Tag count DataFrames

In [35]:
a_list = []

'''For actors, the top three parts of speech were always
    NOUN, VERB, ADJ in that order. The top three named entities
    were always PERSON, ORGANIZATION, LOCATION in that order.'''

for a in actor_names:
    nouncount = a_pos_dict[a][0][1]
    verbcount = a_pos_dict[a][1][1]
    adjcount = a_pos_dict[a][2][1]
    pcount = a_ner_dict[a][0][1]
    orgcount = a_ner_dict[a][1][1]
    lcount = a_ner_dict[a][2][1]
    a_list.append([nouncount,verbcount,adjcount,pcount,orgcount,lcount])
    
actors_df = pd.DataFrame(a_list,columns=['Nouns',
                                        'Verbs','Adjectives',
                                        'Person','Organization',
                                        'Location'],index=[actor_names])
actors_df

Unnamed: 0,Nouns,Verbs,Adjectives,Person,Organization,Location
Tessa Thompson,419,69,40,124,36,16
Lupita Nyong'o,1422,320,229,253,170,82
Winona Ryder,1619,389,242,411,159,43
Amy Poehler,1717,425,280,394,131,53
Bruce Campbell,927,217,109,182,89,18
Adam Scott,417,69,46,92,45,6
Nicolas Cage,1712,476,285,332,151,75
Tim Curry,1298,252,145,250,155,48
Margot Robbie,845,177,108,265,64,15
Sandra Bullock,1555,432,274,291,135,70


In [36]:
m_list = []

''' For movies, the top three parts of speech were always
    NOUN, VERB, ADJ in that order.
    Most movies had ORGANIZATION as the second most
    common named entity. However, some had
    LOCATION instead. I have accounted for both
    possibilities.'''

for m in movie_names:
    nouncount = m_pos_dict[m][0][1]
    verbcount = m_pos_dict[m][1][1]
    adjcount = m_pos_dict[m][2][1]
    pcount = m_ner_dict[m][0][1]
    if m_ner_dict[m][1][0] == 'ORGANIZATION':
        orgcount = m_ner_dict[m][1][1]
    else:
        orgcount = m_ner_dict[m][2][1]
    if m_ner_dict[m][2][0] == 'LOCATION':
        lcount = m_ner_dict[m][2][1]
    else:
        lcount = m_ner_dict[m][1][1]
    m_list.append([nouncount,verbcount,adjcount,pcount,orgcount,lcount])
    
movies_df = pd.DataFrame(m_list,columns=['Nouns',
                                        'Verbs','Adjectives',
                                        'Person','Organization',
                                        'Location'],index=movie_names)
movies_df

Unnamed: 0,Nouns,Verbs,Adjectives,Person,Organization,Location
Sorry to Bother You,637,213,144,109,22,17
Us,798,281,141,115,40,51
Heathers,1031,315,179,285,98,17
Wet Hot American Summer,572,132,91,91,41,30
Evil Dead II,1213,393,244,249,49,30
Piranha 3D,827,259,131,228,28,18
Con Air,933,272,148,154,56,58
The Rocky Horror Picture Show,2384,589,389,419,310,151
"I, Tonya",1071,348,179,330,45,18
Miss Congeniality,517,132,65,96,24,50


### Summary statistics
##### Summary Statistics function

In [37]:
def sum_stats(my_df,col_name):
    tmp_list = []
    tmp_var = my_df[col_name]
    for i in range(10):
        tmp_list.append(tmp_var[i])
    tmp_arr = np.array(tmp_list)
    return (tmp_arr.mean(),tmp_arr.min(),tmp_arr.max(),round(tmp_arr.std(),2))

##### Actors Summary Statistics

In [39]:
col_list = ['Nouns','Verbs','Adjectives','Person','Organization','Location']
a_sum_stats_dict = {}
for c in col_list:
    me,mi,ma,s = sum_stats(actors_df,c)
    a_sum_stats_dict[c] = (me,mi,ma,ma-mi,s)

a_sumstats_df = pd.DataFrame(a_sum_stats_dict, index=['Mean','Min','Max','Range','St.Dev'])
a_sumstats_df

Unnamed: 0,Nouns,Verbs,Adjectives,Person,Organization,Location
Mean,1193.1,282.6,175.8,259.4,113.5,42.6
Min,417.0,69.0,40.0,92.0,36.0,6.0
Max,1717.0,476.0,285.0,411.0,170.0,82.0
Range,1300.0,407.0,245.0,319.0,134.0,76.0
St.Dev,480.94,141.55,92.16,99.94,47.86,26.28


##### Movie Summary statistics

In [40]:
m_sum_stats_dict = {}
for c in col_list:
    me,mi,ma,s = sum_stats(movies_df,c)
    m_sum_stats_dict[c] = (me,mi,ma,ma-mi,s)

m_sumstats_df = pd.DataFrame(m_sum_stats_dict, index=['Mean','Min','Max','Range','St.Dev'])
m_sumstats_df

Unnamed: 0,Nouns,Verbs,Adjectives,Person,Organization,Location
Mean,998.3,293.4,171.1,207.6,71.3,44.0
Min,517.0,132.0,65.0,91.0,22.0,17.0
Max,2384.0,589.0,389.0,419.0,310.0,151.0
Range,1867.0,457.0,324.0,328.0,288.0,134.0
St.Dev,508.95,126.93,86.24,107.24,82.19,38.64


##### Summary statistics interpretation

Based on these summary statistics, it appears that the Movies data frame shows more variation overall. The range of the "nouns" tag count is higher here than for Actors. The amount of "organization" and "location" tags are much more variable for Movies than Actors, as shown by the broader range and higher standard deviation. In both cases, the ranges and standard deviations are approximately doubled for Movies relative to Actors.

There are also some similarities between the two categories. The "person" tag is most similar between Movies and Actors overall. The average amounts of nouns, verbs, adjectives, persons, organizations, and locations are comparable between Actors and Movies. In other words, there is a fairly consistent ratio. The amount of Nouns per individual actor or movie is the most variable for both categories.

### Feature Selection

In [41]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

My suspicion is that the most telling features are "nouns","person","organization",and "location". I am testing this using the k best algorithm for feature selection. Since this is a small data set, I am going to test the ML models with both the k best features and the full set.

In [42]:
actors_np = actors_df.to_numpy()
movies_np = movies_df.to_numpy()

In [43]:
'''Create Numpy array to be the key for ML training
The goal is to train the ML algorithms to match this key.
1 denotes actor, 0 denotes movie'''
goal = np.array([1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0])
'''Create a Numpy array that combines actors and movies.
The first 10 rows are actors (1) and the last 10 are
movies (0). This corresponds to the previous Numpy array.'''
m_a_np = np.concatenate((actors_np,movies_np))
m_a_np

array([[ 419,   69,   40,  124,   36,   16],
       [1422,  320,  229,  253,  170,   82],
       [1619,  389,  242,  411,  159,   43],
       [1717,  425,  280,  394,  131,   53],
       [ 927,  217,  109,  182,   89,   18],
       [ 417,   69,   46,   92,   45,    6],
       [1712,  476,  285,  332,  151,   75],
       [1298,  252,  145,  250,  155,   48],
       [ 845,  177,  108,  265,   64,   15],
       [1555,  432,  274,  291,  135,   70],
       [ 637,  213,  144,  109,   22,   17],
       [ 798,  281,  141,  115,   40,   51],
       [1031,  315,  179,  285,   98,   17],
       [ 572,  132,   91,   91,   41,   30],
       [1213,  393,  244,  249,   49,   30],
       [ 827,  259,  131,  228,   28,   18],
       [ 933,  272,  148,  154,   56,   58],
       [2384,  589,  389,  419,  310,  151],
       [1071,  348,  179,  330,   45,   18],
       [ 517,  132,   65,   96,   24,   50]])

In [44]:
m_a_new = SelectKBest(chi2, k=4).fit_transform(m_a_np, goal)
m_a_new

array([[ 419,   69,  124,   36],
       [1422,  320,  253,  170],
       [1619,  389,  411,  159],
       [1717,  425,  394,  131],
       [ 927,  217,  182,   89],
       [ 417,   69,   92,   45],
       [1712,  476,  332,  151],
       [1298,  252,  250,  155],
       [ 845,  177,  265,   64],
       [1555,  432,  291,  135],
       [ 637,  213,  109,   22],
       [ 798,  281,  115,   40],
       [1031,  315,  285,   98],
       [ 572,  132,   91,   41],
       [1213,  393,  249,   49],
       [ 827,  259,  228,   28],
       [ 933,  272,  154,   56],
       [2384,  589,  419,  310],
       [1071,  348,  330,   45],
       [ 517,  132,   96,   24]])

I anticipated that "nouns","person","organization", and "location" would be the most telling. The best 4 features are actually "nouns", "verbs", "person", and "organization".

### Machine Learning Classifiers

##### ML Classifiers

In [45]:
# Naive Bayes
clf_nb = MultinomialNB()
# Decision Tree
clf_dt = tree.DecisionTreeClassifier()

# Ensemble methods
# Random forest
clf_rf = RandomForestClassifier(n_estimators=100, 
                                max_features="sqrt",
                                max_depth=None,
                                min_samples_split=2, 
                                random_state=0)
# Gradient boost
clf_gb = GradientBoostingClassifier(n_estimators=100, 
                                    learning_rate=1.0,
                                    max_depth=None, 
                                    random_state=0)

## ML Model Training and Evaluation using K best features

##### Naive Bayes, mean F1 score and mean accuracy score

In [46]:
k_nb_scores_f1 = cross_val_score(clf_nb, m_a_new, goal, cv=5,scoring='f1')
k_nb_scores_acc = cross_val_score(clf_nb,m_a_new,goal,cv=5,scoring='accuracy')
k_nb_scores_f1.mean(),k_nb_scores_acc.mean()

(0.8533333333333333, 0.85)

##### Decision Tree, mean F1 and mean accuracy

In [47]:
k_dt_scores_f1 = cross_val_score(clf_dt, m_a_new, goal, cv=5,scoring='f1')
k_dt_scores_acc = cross_val_score(clf_dt,m_a_new,goal,cv=5,scoring='accuracy')
k_dt_scores_f1.mean(),k_dt_scores_acc.mean()

(0.8133333333333335, 0.7)

##### Gradient Boosting, mean F1 and mean accuracy

In [48]:
k_gb_scores_f1 = cross_val_score(clf_gb, m_a_new, goal, cv=5,scoring='f1')
k_gb_scores_acc = cross_val_score(clf_gb,m_a_new,goal,cv=5,scoring='accuracy')
k_gb_scores_f1.mean(),k_gb_scores_acc.mean()

(0.8533333333333333, 0.85)