# No Fate But What We Make

There is an ongoing, pernicious problem in the world: people have to wait for a movie to come out to know how good it is. 

This is completely unacceptable. 

Sure, we can wait for reviews to come in, or, better yet, wait for aggregates of reviews to come in on sites such as Metacritic or Rotten Tomatoes, but wouldn't it be great to know that the upcoming superhero movie will disappoint you before you waste time getting excited for the trailers?

Movies are not black boxes (except for The Black Box (2005) and Black Box (2012)). They're produced by people who have produced other movies, written by people who have written other movies, performed in by people who have performed in other movies, and directed by people who have directed other movies. Given that people who work in movies tend to be consistent in their quality of output (see: Michael Bay), we should be able to predict qualities of a movie with only the knowledge of how past movies have performed. 

It's easy enough to say that the new Jared Leto Joker movie will be awful. It will be. Every DC movie has been awful except for one, and Suicide Squad, the only other movie where Jared Leto played the Joker, was truly a crime against humanity. It's easy enough to say that, but can we teach a computer to say it? 

# Some Fellas Collect Stamps

In order to predict 


In [38]:
import imdb
import re
import pandas as pd
import numpy as np
import ast
from datetime import datetime, timedelta
from sklearn.feature_extraction.text import CountVectorizer

In [39]:
def release_to_datetime(n):
    if type(n) == str:
        n = n.replace(' Nov ', '-11-').replace(' Jun ', '-6-').replace(' Aug ', '-8-').replace(' Dec ', '-12-')
        n = n.replace(' Oct ', '-10-').replace(' Jan ', '-1-').replace(' Feb ', '-2-').replace(' Mar ', '-3-')
        n = n.replace(' Apr ', '-4-').replace(' May ', '-5-').replace(' Jul ', '-7-').replace(' Sep ', '-9-')
        n = datetime.strptime(n, '%d-%m-%Y').date()
        return n
    else:
        return n

def delta_release(n):
    y2k = datetime.strptime('01-01-2000', '%d-%m-%Y').date()
    try:
        m = y2k - n
        return m.days
    except:
        return np.nan

def pull_month(n):
    try:
        return n.month
    except:
        return np.nan

def pull_day(n):
    try:
        return n.day
    except:
        return np.nan
    
def runtime_to_float(x):
    try:
        return float(x)
    except:
        return np.nan
    
def boxoffice_to_float(x):
    try:
        return float(x.replace(',',"").replace("$",""))
    except:
        return np.nan

def RT_pull_out(entry):
    for m in entry:
        if m['Source'] == 'Rotten Tomatoes':
            n = (m['Value'].replace('%', ''))
            return(int(n))
        else:
            return(np.nan)
        
def evan_train_test_df_cvec_capstone(train, test, min_df):
    min_df = min_df
    dummy_list_train = []
    dummy_list_test = []
    for x in train.columns:
        cvec = CountVectorizer(binary=True,
                               tokenizer=(lambda m: m.split('|') ),
                               min_df = min_df,
                               stop_words = 'english',
                               strip_accents='unicode')
        cvec.fit(train['{}'.format(x)])
        lonely_matrix_train = cvec.transform(train['{}'.format(x)])
        lonely_matrix_test = cvec.transform(test['{}'.format(x)])
        df_train = pd.DataFrame(lonely_matrix_train.todense(), columns=cvec.get_feature_names())
        df_test = pd.DataFrame(lonely_matrix_test.todense(), columns=cvec.get_feature_names())
        dummy_list_train.append(df_train)
        dummy_list_test.append(df_test)
    dummied_df_train = pd.concat(dummy_list_train, axis=1)
    dummied_df_test = pd.concat(dummy_list_test, axis=1)
    return dummied_df_train, dummied_df_test

def movie_split_and_join(train, test, func, min_df=1):
    train_obj = train.select_dtypes(include=[np.object_])
    train_num = train.select_dtypes(include=[np.number, np.bool_])
    test_obj = test.select_dtypes(include=[np.object_])
    test_num = test.select_dtypes(include=[np.number, np.bool_])
    train_obj_dums, test_obj_dums = func(train_obj, test_obj, min_df)
    train_obj_dums.reset_index(drop=True, inplace=True)
    test_obj_dums.reset_index(drop=True, inplace=True)
    train_num.reset_index(drop=True, inplace=True)
    test_num.reset_index(drop=True, inplace=True)
    final_train = pd.concat([train_num, train_obj_dums], axis=1)
    final_test = pd.concat([test_num, test_obj_dums], axis=1)
    return final_train, final_test

def strip_list(column):
    for n in column:
        for m in range(len(n)):
            n[m] = n[m].strip()
    return column

def put_in_avgs(train, test, df):
    ind = 0
    for n in train.columns:
        for m in list(zip(df.name, df.avgscore)):
            if n == m[0]:
                train[n] *= m[1]
            ind += 1
            if ind % 10000 == 0:
                print(ind)
    ind = 0
    for n in test.columns:
        for m in list(zip(df.name, df.avgscore)):
            if n == m[0]:
                test[n] *= m[1]
            ind += 1
            if ind % 10000 == 0:
                print(ind)
    print(train.shape)
    print(test.shape)
    return train, test

def single_column_cvec(train, test, min_df):
    cvec = CountVectorizer(binary=True,
                                   tokenizer=(lambda m: m.split('|') ),
                                   min_df = min_df,
                                   stop_words = 'english',
                                   strip_accents='unicode')
    cvec.fit(train)
    lonely_matrix_train = cvec.transform(train)
    lonely_matrix_test = cvec.transform(test)
    new_train = pd.DataFrame(lonely_matrix_train.todense(), columns=cvec.get_feature_names())
    new_test = pd.DataFrame(lonely_matrix_test.todense(), columns=cvec.get_feature_names())
    return new_train, new_test

In [40]:
df = pd.read_csv("new_all_my_movies_final.csv", converters={"Actors": ast.literal_eval, 
                                                        "Director": ast.literal_eval, 
                                                        "Genre": ast.literal_eval, 
                                                        "RTRating": ast.literal_eval, 
                                                        "Writer": ast.literal_eval,
                                                        "Year": ast.literal_eval})

In [4]:
award_df = pd.read_csv('meta_award_add_final.csv') # never got around to using this
writers_df = pd.read_csv('writers_df.csv')
actors_df = pd.read_csv('actors_df.csv')
directors_df = pd.read_csv('directors_df.csv')
# actoravg= pd.read_csv('NewActorAvg.csv') # Plaguing problem 
# morta_df = pd.read_csv('morta.csv') # Same Plaguing problem

In [5]:
# Getting rid of pesky extra spaces
df.Actors = strip_list(df.Actors)
df.Director = strip_list(df.Director)
df.Writer = strip_list(df.Writer)
df.Genre = strip_list(df.Genre)

# Getting rid of our silly index column and dropping the duplicates
df.drop(['Unnamed: 0'], axis=1, inplace=True) 
df = df.drop_duplicates(subset=['imdbID'], keep='first')

# Joining actor list as pipes
df.Actors = df.Actors.map(lambda x: '|'.join(x))

# Joining directors as pipes
# Taking out any stuff in parentheses
df.Director = df.Director.map(lambda x: '|'.join(x))
df.Director = df.Director.map(lambda x: re.sub("[\(\[].*?[\)\]]", "", x))

# Joining genres as pipes
df.Genre = df.Genre.map(lambda x: '|'.join(x))

# Joining writers them as pipes
# Taking out any stuff in parentheses
df.Writer = df.Writer.map(lambda x: '|'.join(x))
df.Writer = df.Writer.map(lambda x: re.sub("[\(\[].*?[\)\]]", "", x))

# Pulling out rotten tomato rating from the RTRating column
df.RTRating = df.RTRating.map(RT_pull_out)

# Turning released to datetime object as well as creating a delta column
# Also creating a column for number of month and number of day
df.Released = df.Released.map(release_to_datetime)
df['days_from_y2k'] = df.Released.map(delta_release)
df['month'] = df.Released.map(pull_month)
df['day'] = df.Released.map(pull_day)

# Turning runtime and boxxofice to to float objects
df.Runtime = df.Runtime.map(runtime_to_float)
df.BoxOffice = df.BoxOffice.map(boxoffice_to_float)

# Sorting the dataframe on released
df = df.sort_values(['Released'], ascending=True)
df.reset_index(drop=True, inplace=True)

# Turning imdbID to a column of floats
df.imdbID.map(lambda x: float(x.replace('tt', '.')))

# Adding a title length column 
df['title_length'] = df.Title.map(lambda x: len(x))

In [6]:
df.columns

Index(['Actors', 'BoxOffice', 'Director', 'Genre', 'Metascore', 'Production',
       'RTRating', 'Rated', 'Released', 'Runtime', 'Title', 'Writer', 'Year',
       'imdbID', 'imdbRating', 'days_from_y2k', 'month', 'day',
       'title_length'],
      dtype='object')

In [7]:
# Metacritic score piece
meta = df.copy()

# Dropping rows that have no metascore or imdbID
meta = meta.dropna(axis=0, how='any', subset=['Metascore', 'imdbID']).copy()

# Dropping BoxOffice amount, Rotten Tomatoes rating, Release date.
meta.drop(['BoxOffice', 'RTRating', 'Released', 'imdbRating'], axis=1, inplace=True)

#Dropping all movies that have a null value for runtime or for "days from y2k."
meta.dropna(axis=0, how='any', subset=['Runtime', 'days_from_y2k'], inplace=True)

# Dropping "days from y2k"
meta.drop(['days_from_y2k'], axis=1, inplace=True)

# Filling production and rated column nulls with "unknown"
meta['Production'].fillna('Unknown', inplace=True)
meta['Rated'].fillna('Unknown', inplace=True)

# Just making sure our null actor averages are np.nan to be deleted later
# Problem that plagued me forever
# meta['ActorAvg'].fillna(np.nan, inplace=True)

In [8]:
meta.Writer.isnull().sum()

0

In [9]:
# Setting a custom train and test set

X_train = meta.iloc[:9000,:]
X_test = meta.iloc[9000:,:]


# Here we set Director, Actor, and Writer columns that got average scores over their movies

# Directors = meta.iloc[:9000,:].drop(['Actors', 'Genre', 'Production', 'Rated',
#        'Runtime', 'Title', 'Writer', 'Year', 'imdbID', 'imdbRating', 'month', 'day',
#        'title_length'], axis=1)
# Actors = meta.iloc[:9000,:].drop(['Director', 'Genre', 'Production', 'Rated',
#        'Runtime', 'Title', 'Writer', 'Year', 'imdbID', 'imdbRating', 'month', 'day',
#        'title_length'], axis=1)
# Writers = meta.iloc[:9000,:].drop(['Actors', 'Genre', 'Production', 'Rated',
#        'Runtime', 'Title', 'Director', 'Year', 'imdbID', 'imdbRating', 'month', 'day',
#        'title_length'], axis=1)

In [10]:
# Using our custom split and join function to create three lists of train data directors actors and writers

# Directors, _ = movie_split_and_join(Directors, Directors, evan_train_test_df_cvec_capstone, 0)

# Actors, _ = movie_split_and_join(Actors, Actors, evan_train_test_df_cvec_capstone, 0)

# Writers, _ = movie_split_and_join(Writers, Writers, evan_train_test_df_cvec_capstone, 0)

# # Saving each director, actor, and writer in a data frame with their mean metascore. 

# directors_df = []
# for n in Directors.columns:
#     temp_tuple = (n, Directors[Directors[n]==1].Metascore.mean())
#     directors_df.append(temp_tuple)

# actors_df = []
# for n in Actors.columns:
#     temp_tuple = (n, Actors[Actors[n]==1].Metascore.mean())
#     actors_df.append(temp_tuple)

# writers_df = []
# for n in Writers.columns:
#     temp_tuple = (n, Writers[Writers[n]==1].Metascore.mean())
#     writers_df.append(temp_tuple)

In [11]:
# directors_df = []
# for n in Directors.columns:
#     temp_tuple = (n, Directors[Directors[n]==1].Metascore.mean())
#     directors_df.append(temp_tuple)

# actors_df = []
# for n in Actors.columns:
#     temp_tuple = (n, Actors[Actors[n]==1].Metascore.mean())
#     actors_df.append(temp_tuple)

# writers_df = []
# for n in Writers.columns:
#     temp_tuple = (n, Writers[Writers[n]==1].Metascore.mean())
#     writers_df.append(temp_tuple)

In [12]:
# Saving each of our dataframes as csvs for future use.

# pd.DataFrame(writers_df, columns=['name', 'avgscore']).to_csv('writers_df.csv',index=False)
# pd.DataFrame(actors_df, columns=['name', 'avgscore']).to_csv('actors_df.csv',index=False)
# pd.DataFrame(directors_df, columns=['name', 'avgscore']).to_csv('directors_df.csv',index=False)

In [13]:
#Beginnings of problems that plagued me

# morta = df.dropna(axis=0, how='any', subset=['Metascore', 'imdbID']).copy()
# morta['ActorAvg'] = 0.

# Here, for all actors in a movie, we add up their metascores and average them.
# If the actors can't be found in actor df, that is, if they were only in movies without metascores
# or were missed for whatever other reason, we assign a NaN value to the cell, to flag the row for later removal. 

# morta_list = []
# for index, m in enumerate(morta.Actors):
#     s=0
#     den = 0
#     for p in m:
#         for n in zip(actors_df.name.values, actors_df.avgscore):
#             if p.lower() == n[0]:
#                 s = s + n[1]
#                 den = den + 1
#     if den == 0:
#         morta.ActorAvg[index]=666.666
#         morta_list.append(666.666)
#         print(index)
#     else:
#         morta.ActorAvg[index]=s/den
#         morta_list.append(s/den)
#         print(index)

In [14]:
# morta.ActorAvg.sum()

In [15]:
# morta[['ActorAvg','imdbID']].sum()

In [16]:
# morta['ActorAvg'].sum()

In [17]:
# morta[['ActorAvg']].sum()

In [18]:
# morta['ActorAvg'].head()

In [19]:
# morta[['ActorAvg']].head()

In [20]:
# morta.ActorAvg.sum()

In [21]:
# morta_df = pd.DataFrame(morta_list)

In [22]:
# morta_df[0].sum()

In [23]:
# pd.DataFrame(morta_list).to_csv('morta.csv', index=False, header=True)

In [24]:
# avgact = list(zip(morta.ActorAvg, morta.imdbID))

In [25]:
# tempavg = []
# for m in avgact:
#     tempdic = {}
#     tempdic['ActorAvg'] = m[0]
#     tempdic['imdbID'] = m[1]
#     tempavg.append(tempdic)
# pd.DataFrame(tempavg).to_csv('NewActorAvg.csv', index=False, header=True)

In [26]:
# pd.DataFrame(tempavg).ActorAvg.sum()

In [27]:
# morta[['ActorAvg', 'imdbID']].to_csv('NewActorAvg.csv', index=False, header=True)

In [28]:
len(X_test)

1192

In [29]:
X_train.isnull().sum()

Actors          0
Director        0
Genre           0
Metascore       0
Production      0
Rated           0
Runtime         0
Title           0
Writer          0
Year            0
imdbID          0
month           0
day             0
title_length    0
dtype: int64

In [30]:
X_test.isnull().sum()

Actors          0
Director        0
Genre           0
Metascore       0
Production      0
Rated           0
Runtime         0
Title           0
Writer          0
Year            0
imdbID          0
month           0
day             0
title_length    0
dtype: int64

In [31]:
X_train_to_concat, X_test_to_concat = movie_split_and_join(X_train.drop(['Director', 'Actors', 'Writer'], axis=1), 
                                                           X_test.drop(['Director', 'Actors', 'Writer'], axis=1), 
                                                                       evan_train_test_df_cvec_capstone, 
                                                                       2)

director_train, director_test = single_column_cvec(X_train['Director'], X_test['Director'], 2)
actor_train, actor_test = single_column_cvec(X_train['Actors'], X_test['Actors'], 2)
writer_train, writer_test = single_column_cvec(X_train['Writer'], X_test['Writer'], 2)

In [32]:
director_train, director_test = put_in_avgs(director_train, director_test, directors_df)

10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000
510000
520000
530000
540000
550000
560000
570000
580000
590000
600000
610000
620000
630000
640000
650000
660000
670000
680000
690000
700000
710000
720000
730000
740000
750000
760000
770000
780000
790000
800000
810000
820000
830000
840000
850000
860000
870000
880000
890000
900000
910000
920000
930000
940000
950000
960000
970000
980000
990000
1000000
1010000
1020000
1030000
1040000
1050000
1060000
1070000
1080000
1090000
1100000
1110000
1120000
1130000
1140000
1150000
1160000
1170000
1180000
1190000
1200000
1210000
1220000
1230000
1240000
1250000
1260000
1270000
1280000
1290000
1300000
1310000
1320000
1330000
1340000
1350000
1360000
1370000
1380000
1390

10350000
10360000
10370000
10380000
10390000
10400000
10410000
10420000
10430000
10440000
10450000
10460000
10470000
10480000
10490000
10500000
10510000
10520000
10530000
10540000
10550000
10560000
10570000
10580000
10590000
10600000
10610000
10620000
10630000
10640000
10650000
10660000
10670000
10680000
10690000
10700000
10710000
10720000
10730000
10740000
10750000
10760000
10770000
10780000
10790000
10800000
10810000
10820000
10830000
10840000
10850000
10860000
10870000
10880000
10890000
10900000
10910000
10920000
10930000
10940000
10950000
10960000
10970000
10980000
10990000
11000000
11010000
11020000
11030000
11040000
11050000
11060000
11070000
11080000
11090000
11100000
11110000
11120000
11130000
11140000
11150000
11160000
11170000
11180000
11190000
11200000
11210000
11220000
11230000
11240000
11250000
11260000
11270000
11280000
11290000
11300000
11310000
11320000
11330000
11340000
11350000
11360000
11370000
11380000
11390000
11400000
11410000
11420000
11430000
11440000
11450000
1

19480000
19490000
19500000
19510000
19520000
19530000
19540000
19550000
19560000
19570000
19580000
19590000
19600000
19610000
19620000
19630000
19640000
19650000
19660000
19670000
19680000
19690000
19700000
19710000
19720000
19730000
19740000
19750000
19760000
19770000
19780000
19790000
19800000
19810000
19820000
19830000
19840000
19850000
19860000
19870000
19880000
19890000
19900000
19910000
19920000
19930000
19940000
19950000
19960000
19970000
19980000
19990000
20000000
20010000
20020000
20030000
20040000
20050000
20060000
20070000
20080000
20090000
20100000
20110000
20120000
20130000
20140000
20150000
20160000
20170000
20180000
20190000
20200000
20210000
20220000
20230000
20240000
20250000
20260000
20270000
20280000
20290000
20300000
20310000
20320000
20330000
20340000
20350000
20360000
20370000
20380000
20390000
20400000
20410000
20420000
20430000
20440000
20450000
20460000
20470000
20480000
20490000
20500000
20510000
20520000
20530000
20540000
20550000
20560000
20570000
20580000
2

2460000
2470000
2480000
2490000
2500000
2510000
2520000
2530000
2540000
2550000
2560000
2570000
2580000
2590000
2600000
2610000
2620000
2630000
2640000
2650000
2660000
2670000
2680000
2690000
2700000
2710000
2720000
2730000
2740000
2750000
2760000
2770000
2780000
2790000
2800000
2810000
2820000
2830000
2840000
2850000
2860000
2870000
2880000
2890000
2900000
2910000
2920000
2930000
2940000
2950000
2960000
2970000
2980000
2990000
3000000
3010000
3020000
3030000
3040000
3050000
3060000
3070000
3080000
3090000
3100000
3110000
3120000
3130000
3140000
3150000
3160000
3170000
3180000
3190000
3200000
3210000
3220000
3230000
3240000
3250000
3260000
3270000
3280000
3290000
3300000
3310000
3320000
3330000
3340000
3350000
3360000
3370000
3380000
3390000
3400000
3410000
3420000
3430000
3440000
3450000
3460000
3470000
3480000
3490000
3500000
3510000
3520000
3530000
3540000
3550000
3560000
3570000
3580000
3590000
3600000
3610000
3620000
3630000
3640000
3650000
3660000
3670000
3680000
3690000
3700000


12470000
12480000
12490000
12500000
12510000
12520000
12530000
12540000
12550000
12560000
12570000
12580000
12590000
12600000
12610000
12620000
12630000
12640000
12650000
12660000
12670000
12680000
12690000
12700000
12710000
12720000
12730000
12740000
12750000
12760000
12770000
12780000
12790000
12800000
12810000
12820000
12830000
12840000
12850000
12860000
12870000
12880000
12890000
12900000
12910000
12920000
12930000
12940000
12950000
12960000
12970000
12980000
12990000
13000000
13010000
13020000
13030000
13040000
13050000
13060000
13070000
13080000
13090000
13100000
13110000
13120000
13130000
13140000
13150000
13160000
13170000
13180000
13190000
13200000
13210000
13220000
13230000
13240000
13250000
13260000
13270000
13280000
13290000
13300000
13310000
13320000
13330000
13340000
13350000
13360000
13370000
13380000
13390000
13400000
13410000
13420000
13430000
13440000
13450000
13460000
13470000
13480000
13490000
13500000
13510000
13520000
13530000
13540000
13550000
13560000
13570000
1

21690000
21700000
21710000
21720000
21730000
21740000
21750000
21760000
21770000
21780000
21790000
21800000
21810000
21820000
21830000
21840000
21850000
21860000
21870000
21880000
21890000
21900000
21910000
21920000
21930000
21940000
21950000
21960000
21970000
21980000
21990000
22000000
22010000
22020000
22030000
22040000
22050000
22060000
22070000
22080000
22090000
22100000
22110000
22120000
22130000
22140000
22150000
22160000
22170000
22180000
22190000
22200000
22210000
22220000
22230000
22240000
22250000
22260000
22270000
22280000
22290000
22300000
22310000
22320000
22330000
22340000
22350000
22360000
22370000
22380000
22390000
22400000
22410000
22420000
22430000
22440000
22450000
22460000
22470000
22480000
22490000
22500000
22510000
22520000
22530000
22540000
22550000
22560000
22570000
22580000
22590000
22600000
22610000
22620000
22630000
22640000
22650000
22660000
22670000
22680000
22690000
22700000
22710000
22720000
22730000
22740000
22750000
22760000
22770000
22780000
22790000
2

In [33]:
X_train_direct = pd.concat([X_train_to_concat, director_train, writer_train, actor_train], axis=1)
X_test_direct = pd.concat([X_test_to_concat, director_test, writer_test, actor_test], axis=1)

In [36]:
# Saving our files to dataframes with directors weighted

# pd.DataFrame(X_train_direct).to_csv('train_everything_director_weights_df1.csv')
# pd.DataFrame(X_test_direct).to_csv('test_everything_director_weights_df1.csv')
pd.DataFrame(X_train_direct).to_csv('train_everything_director_weights_df2.csv')
pd.DataFrame(X_test_direct).to_csv('test_everything_director_weights_df2.csv')
# pd.DataFrame(X_train_direct).to_csv('train_everything_director_weights_df3.csv')
# pd.DataFrame(X_test_direct).to_csv('test_everything_director_weights_df3.csv')

In [None]:
# Pre-made dataframes with directors weighted

# X_train = pd.read_csv('train_everything_director_weights_df1.csv')
# X_test = pd.read_csv('test_everything_director_weights_df1.csv')
# X_train = pd.read_csv('train_everything_director_weights_df2.csv')
# X_test = pd.read_csv('test_everything_director_weights_df2.csv')
# X_train = pd.read_csv('train_everything_director_weights_df3.csv')
# X_test = pd.read_csv('test_everything_director_weights_df3.csv')

In [37]:
actor_train, actor_test = put_in_avgs(actor_train, actor_test, actors_df)

10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000


  self.values = np.delete(self.values, loc, 0)


KeyboardInterrupt: 

In [None]:
X_train_act = pd.concat([X_train_to_concat, director_train, writer_train, actor_train], axis=1)
X_test_act = pd.concat([X_test_to_concat, director_test, writer_test, actor_test], axis=1)

In [None]:
# Saving our files to dataframes with directors and actors weighted

pd.DataFrame(X_train_act).to_csv('train_everything_director_actor_weights_df1.csv')
pd.DataFrame(X_test_act).to_csv('test_everything_director_actor_weights_df1.csv')
# pd.DataFrame(X_train_act).to_csv('train_everything_director_actor_weights_df2.csv')
# pd.DataFrame(X_test_act).to_csv('test_everything_director_actor_weights_df2.csv')
# pd.DataFrame(X_train_act).to_csv('train_everything_director_actor_weights_df3.csv')
# pd.DataFrame(X_test_act).to_csv('test_everything_director_actor_weights_df3.csv')

In [None]:
# Pre-made dataframes with directors weighted, actors weighted

# X_train = pd.read_csv('train_everything_director_actor_weights_df1.csv')
# X_test = pd.read_csv('test_everything_director_actor_weights_df1.csv')
# X_train = pd.read_csv('train_everything_director_actor_weights_df2.csv')
# X_test = pd.read_csv('test_everything_director_actor_weights_df2.csv')
# X_train = pd.read_csv('train_everything_director_actor_weights_df3.csv')
# X_test = pd.read_csv('test_everything_director_actor_weights_df3.csv')

In [None]:
writer_train, writer_test = put_in_avgs(writer_train, writer_test, writers_df)

In [None]:
X_train_write = pd.concat([X_train_to_concat, director_train, writer_train, actor_train], axis=1)
X_test_write = pd.concat([X_test_to_concat, director_test, writer_test, actor_test], axis=1)

In [None]:
# Pre-made dataframes with directors, actors, and writers weighted

pd.DataFrame(X_train_write).to_csv('train_everything_director_actor_writer_weights_df1.csv')
pd.DataFrame(X_test_write).to_csv('test_everything_director_actor_writer_weights_df1.csv')
# pd.DataFrame(X_train_write).to_csv('train_everything_director_actor_writer_weights_df2.csv')
# pd.DataFrame(X_test_write).to_csv('test_everything_director_actor_writer_weights_df2.csv')
# pd.DataFrame(X_train_write).to_csv('train_everything_director_actor_writer_weights_df3.csv')
# pd.DataFrame(X_test_write).to_csv('test_everything_director_actor_writer_weights_df3.csv')

In [None]:
# Pre-made dataframes with directors weighted, actors weighted, and writers weighted

# X_train = pd.read_csv('train_everything_director_actor_weights_writer_weights_df1.csv')
# X_test = pd.read_csv('test_everything_director_actor_weights_writer_weights_df1.csv')
# X_train = pd.read_csv('train_everything_director_actor_weights_writer_weights_df2.csv')
# X_test = pd.read_csv('test_everything_director_actor_weights_writer_weights_df2.csv')
# X_train = pd.read_csv('train_everything_director_actor_weights_writer_weights_df3.csv')
# X_test = pd.read_csv('test_everything_director_actor_weights_writer_weights_df3.csv')

In [None]:
X_train.dropna(axis=0, how='any', subset=['ActorAvg'], inplace=True) # including actor average
X_test.dropna(axis=0, how='any', subset=['ActorAvg'], inplace=True) # including actor average

# X_train.drop(['Actors'], axis=1, inplace=True)
# X_test.drop(['Actors'], axis=1, inplace=True)

# X_train.drop(["ActorAvg"], axis=1, inplace=True) 
# X_test.drop(['ActorAvg'], axis=1, inplace=True)

y_train = X_train.Metascore
y_test = X_test.Metascore

X_train.drop(['Metascore','imdbRating'], axis=1, inplace=True)
X_test.drop(['Metascore','imdbRating'], axis=1, inplace=True)

In [None]:
X_train.isnull().sum()

In [None]:
# If doing classification:
median = np.median(y_train)

new_y = []
for n in y_train:
    if n > median:
        new_y.append(1)
    else:
        new_y.append(0)
y_train = new_y

new_y = []
for n in y_test:
    if n > median:
        new_y.append(1)
    else:
        new_y.append(0)
y_test = new_y

In [None]:
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso, SGDRegressor
from sklearn.feature_selection import RFE
# from sklearn.neighbors import KNeighborsRegressor
# from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
# from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, f1_score
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.utils import np_utils
import matplotlib.pyplot as plt 

%matplotlib inline

In [None]:
br = BaggingRegressor()
br.fit(X_train, y_train)
# print('br train score: ')
# print(br.score(X_train, y_train))
print('br test score: ')
print(br.score(X_test, y_test))
print()

# Vanilla results
# br test score: 
# 0.20559549382827524

# With weighted directors
# br test score: 
# 0.19383386230059885

# With min_df = 3
# br test score: 
# 0.20237867138707444

# Without actor column 
# br test score: 
# 0.09933272358761469

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('rf test score: ')
print(rf.score(X_test, y_test))
print()

# Vanilla results
# rf test score: 
# 0.2060842519796644

# With weighted directors
# rf test score: 
# 0.2126052820813754

# With min_df = 3
# rf test score: 
# 0.2231686501613008

# Without actors column
# rf test score: 
# 0.09806195437754006

In [None]:
lasso = Lasso(.15)
lasso.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('lasso test score: ')
print(lasso.score(X_test, y_test))
print()

# Vanilla results
# lasso test score: 
# alpha = .15
# lasso test score: 
# 0.2748017680997358

# With weighted directors
# lasso test score: 
# 0.28557531109204537

# With min_df = 3
# lasso test score: 
# 0.2831088809480061

# Without Actors column 
# lasso test score: 
# 0.22537316511240568

In [None]:
rf_params = {
    'max_depth': [None],
    'n_estimators': [500],
    'max_features': [10],
}

gs = GridSearchCV(rf, param_grid=rf_params)
gs.fit(X_train, y_train)
print(gs.score(X_test, y_test))
print(gs.best_score_)
print(gs.best_params_)

# 0.25813842124388553
# 0.2406969752051017
# {'max_depth': None, 'max_features': 10, 'n_estimators': 30}

# 0.2803089996507978
# 0.24567245984942382
# {'max_depth': None, 'max_features': 10, 'n_estimators': 100}

# 0.27221153535207276
# 0.2559582447598002
# {'max_depth': None, 'max_features': 10, 'n_estimators': 200}

# 0.2811413916107678
# 0.2586980681790728
# {'max_depth': None, 'max_features': 10, 'n_estimators': 500}

# Without Actors Column
# 0.29972139350423055
# 0.4539121713903267
# {'max_depth': None, 'max_features': 10, 'n_estimators': 500}

0.30695191022026025
0.4370792469556989
{'max_depth': None, 'max_features': 10, 'n_estimators': 500}


In [None]:
from keras import regularizers
from keras.layers import Dropout
from keras.callbacks import EarlyStopping


model = Sequential()
n_input = X_train.shape[1]
n_hidden = n_input
n_output = 1

early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=0, verbose=1, mode='auto')
callbacks_list = [early_stop]

model.add(Dense(n_hidden, input_dim=n_input, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
# regularizing weights coming into the this next layer
model.add(Dropout(0.5)) # refers to nodes in the first hidden layer

model.add(Dense(n_hidden, input_dim=np.floor(np.sqrt(n_input)), activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dropout(0.5)) # refers to nodes in the first hidden layer

model.add(Dense(n_output, activation='relu', kernel_regularizer=regularizers.l2(0.01)))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=100, callbacks = callbacks_list)

In [None]:
# train_loss = history.history['loss']
# test_loss = history.history['val_loss']
# plt.plot(train_loss, label='Training loss')
# plt.plot(test_loss, label='Testing loss')

In [None]:
# history.history['acc'][-1], history.history['val_acc'][-1]

In [None]:
# from sklearn.metrics import roc_auc_score
# from sklearn.metrics import recall_score
# # sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None)[source]
# y_pred = model.predict_classes(X_test)

In [None]:
# roc_auc_score(y_test, y_pred)

In [None]:
# recall_score(y_test, y_pred)

In [None]:
# y_train

In [None]:
# np.median(y_train)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC

In [None]:
logreg = LogisticRegression() 
logreg.fit(X_train, y_train)
# print('logreg train score: ')
# print(logreg.score(X_train, y_train))
print('logreg test score: ')
print(logreg.score(X_test, y_test))
print()

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
# print('knn train score: ')
# print(knn.score(X_train, y_train))
print('knn test score: ')
print(knn.score(X_test, y_test))
print()

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
# print('dt train score: ')
# print(dt.score(X_train, y_train))
print('dt test score: ')
print(dt.score(X_test, y_test))
print()

br = BaggingClassifier()
br.fit(X_train, y_train)
# print('br train score: ')
# print(br.score(X_train, y_train))
print('br test score: ')
print(br.score(X_test, y_test))
print()

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('rf test score: ')
print(rf.score(X_test, y_test))
print()

ab = AdaBoostClassifier()
ab.fit(X_train, y_train)
# print('ab train score: ')
# print(ab.score(X_train, y_train))
print('ab test score: ')
print(ab.score(X_test, y_test))
print()

svc = SVC()
svc.fit(X_train, y_train)
# print('svc train score: ')
# print(svc.score(X_train, y_train))
print('svc test score: ')
print(svc.score(X_test, y_test))
print()

# director weights df 1
# logreg test score: 
# 0.529276693455798

# knn test score: 
# 0.5063145809414467

# dt test score: 
# 0.6016073478760046

# br test score: 
# 0.6463834672789897

# rf test score: 
# 0.6371986222732492

# ab test score: 
# 0.6326061997703789

# svc test score: 
# 0.5281285878300803


In [None]:

"""# failed model graveyard

lr = LinearRegression()
lr.fit(X_train, y_train)
# print('lr train score: ')
# print(lr.score(X_train, y_train))
print('lr test score: ')
print(lr.score(X_test, y_test))
print()

# Vanilla results
# lr train score: 
# 1.0
# lr test score: 
# 0.24520781768736474

ab = AdaBoostRegressor()
ab.fit(X_train, y_train)
# print('ab train score: ')
# print(ab.score(X_train, y_train))
print('ab test score: ')
print(ab.score(X_test, y_test))
print()

# Vanilla results
# ab test score: 
# 0.14390680661700905

svr = SVR()
svr.fit(X_train, y_train)
# print('svr train score: ')
# print(svr.score(X_train, y_train))
print('svr test score: ')
print(svr.score(X_test, y_test))
print()

# Vanilla results
# svr test score: 
# -0.02440751761599569

dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
# print('dt train score: ')
# print(dt.score(X_train, y_train))
print('dt test score: ')
print(dt.score(X_test, y_test))
print()

# Vanilla results
# dt test score: 
# -0.026082404875374054

knn = KNeighborsRegressor()
knn.fit(X_train, y_train)
# print('knn train score: ')
# print(knn.score(X_train, y_train))
print('knn test score: ')
print(knn.score(X_test, y_test))
print()

# Vanilla results
# knn test score: 
# -0.3078089090936227

sgd = SGDRegressor(penalty='l1')
sgd.fit(X_train, y_train)
# print('rf train score: ')
# print(rf.score(X_train, y_train))
print('sgd test score: ')
print(sgd.score(X_test, y_test))
print()

# penalty = 'l1
# max_iter = 500
# sgd test score: 
# -1.4108674430205027e+21

# With weighted directors
# sgd test score: 
# -1.4854946130885348e+21

# With min_df = 3
# sgd test score: 
# -8.040716721738737e+19

# Without Actor column
# sgd test score: 
# -1.4555174064291366e+21

"""

In [None]:
movie_list = []
slug_list = ['numbers', 'A','B','C','D','E','F','G','H','I','J–K','L','M','N-O','P','Q–R','S','T','U-W','X–Z']
for title in slug_list:
    df = wikipedia.page(f'List of films:_{title}')
    movie_list.extend(df.links)

In [None]:
for index, n in enumerate(movie_list):
    if 'ist of fil' in n:
        movie_list.remove(n)
    if '(film)' in n:
        movie_list[index] = n[0:-7]
    if ' film)' in n:
        movie_list[index] = (n.split(' (')[0],n.split('(')[1][0:4])

movie_list

In [None]:
movie_list = list(set(movie_list))

In [None]:
len(movie_list)

In [None]:
test_list = []
a = ('test', '1676')
b = ('tust', '1778')
c = ('tist', '8989')
test_list.append(a)
test_list.append(b)
test_list.append(c)
test_list
type(a)

In [None]:
data.title[2][0:-7]
data.title[2][-5:-1]


In [None]:
data = pd.read_csv('movies.csv')

In [None]:
for n in data.title:
    movie_list.append((n[0:-7].split(' (')[0].split(', The')[0].replace('&','and'), n[-5:-1]))

In [None]:
type(movie_list[9])

In [None]:
# df = pd.DataFrame(movie_list, columns=['title']) 

In [None]:
# pd.DataFrame(movie_list, columns=['title']).to_csv('movie_list.csv')

In [None]:
len(movie_list)

In [None]:
list_of_dicts = []
for title in (movie_list):
    temp_dict = {}
    if type(title) == str:
        temp_dict['title'] = title
        temp_dict['year'] = np.nan
    else:
        temp_dict['title'] = title[0]
        temp_dict['year'] = title[1]
    list_of_dicts.append(temp_dict)
list_of_dicts

In [None]:
pd.DataFrame(list_of_dicts).to_csv('new_movie_list.csv', index=False)

In [None]:
new_movie_list = pd.read_csv('new_movie_list.csv')

In [None]:
new_movie_list

In [None]:
[print(type(y)) for x,y in zip(new_movie_list.head(100).title, new_movie_list.head(100).year)]

In [None]:
title = 'The Dark Knight'
murl = (f'http://www.omdbapi.com/?apikey=7c448c5&t={title}&r=json')

res = requests.get(murl)

res.json()

In [None]:
movie_list = pd.read_csv('movie_list.csv', converters={"title": ast.literal_eval})

In [None]:
df.head(30)

In [None]:
len(df) - df.BoxOffice.isnull().sum()

In [None]:
ia = imdb.IMDb()

In [None]:
test = ia.get_movie('0408345')

In [None]:
test

# Capstone Project

Your Capstone project is the culmination of your time at GA. You will be tasked with developing an interesting question, collecting the data required to model that data, developing the strongest model (or models) for prediction, and communicating those findings to other data scientists and non-technical individuals. This introductory document lays out the five consitutent portions of the project and their due dates.

## Your Deliverables

- A well-made predictive model using either structured or unstructured machine learning techniques (or other technique approved in advanced by the global instructors), as well as clean, well-written code. 
- A technical report aimed at fellow data scientists that explains your process and findings
- A public presentation of your findings aimed at laypeople. 

### **[Capstone, Part 1: Topic Proposals](./part_01/)**

In Part 1, get started by choosing **three potential topics and problems**, describing your goals & criteria for success, potential audience(s), and identifying 1-2 potential datasets. In the field of data science, good projects are practical. Your capstone project should be manageable and affect a real world audience. This might be a domain you are familiar with, a particular interest you have, something that affects a community you are involved in, or an area that relates to a field you wish to work in.

One of the best ways to test ideas quickly is to share them with others. A good data scientist has to be comfortable discussing ideas and presenting to audiences. That's why for Part 1 of your Capstone project, you'll be preparing a lightning talk in addition to your initial notebook outlining the scope of your project.  You will present your candidate topics in a slide deck, and should be prepared to answer questions and defend your data selection(s). Presentations should take no more than 3-5 minutes.

**The ultimate choice of topic for your capstone project is yours!** However, this is research and development work. Sometimes projects that look easy can be difficult and vice versa. It never hurts to have a second (or third) option available.

- **Goal**: Prepare a 3-5 minute lightning talk that covers three potential topics, including potential sources of data, goals, metrics and audience.
- **Due**: Thursday, June 7

### **[Capstone, Part 2: Problem Statement + EDA](./part_02/)**

For Part 2, provide a clear statement of the problem that you have chosen and an overview of your approach to solving that problem. Summarize your objectives, goals & success metrics, and any risks & assumptions. Outline your proposed methods and models, perform your initial EDA, and summarize the process. **Your data should be in hand by this point in the process!**

**Again, your data should be in hand by this point the process!**

- **Goal**: Describe your proposed approach and summarize your initial EDA in a code submission to your local instructor ([submission link](https://docs.google.com/forms/d/e/1FAIpQLScez-8PsyIgP548fNtsoDpuNTdKxsr6tVvKPDtbr-mQov6NCw/viewform?usp=sf_link))
- **Due**: Wednesday, June 20

### **[Capstone, Part 3: Progress Report + Preliminary Findings](./part_03/)**

In Part 3, you'll create a progress report of your work in order to get feedback along the way. Describe your approach, initial results, and any setbacks or lessons learned so far. Your report should include updated visual and statistical analysis of your data. You’ll also meet with your local instructional team to get feedback on your results so far!

- **Goal**: Discuss progress and setbacks, include visual and statistical analysis, review with instructor. (A submission link for your progress report will be provided prior to the due date.)
- **Due**: Monday, July 2

### **[Capstone, Part 4: Report Writeup + Technical Analysis](./part_04/)**

By now, you're ready to apply your modeling skills to make machine learning predictions. Your goal for Part 4 is to develop a technical document (in the form of Jupyter notebook) that can be shared among your peers.

Document your research and analysis including a summary, an explanation of your modeling approach as well as the strengths and weaknesses of any variables in the process. You should provide insight into your analysis, using best practices like cross validation or applicable prediction metrics.

- **Goal**: Detailed report and code with a summary of your statistical analysis, model, and evaluation metrics.
- **Due**: Friday, July 13

### **[Capstone, Part 5: Presentation + Recommendations](./part_05/)**

Whether during an interview or as part of a job, you will frequently have to present your findings to business partners and other interested parties - many of whom won't know anything about data science! That's why for Part 5, you'll create a presentation of your previous findings with a non-technical audience in mind.

You should already have the analytical work complete, so now it's time to clean up and clarify your findings. Come up with a detailed slide deck or interactive demo that explains your data, visualizes your model, describes your approach, articulates strengths and weaknesses, and presents specific recommendations. Be prepared to explain and defend your model to an inquisitive audience!

- **Goal**: Detailed presentation deck that relates your data, model, and findings to a non-technical audience.
- **Due**: Tuesday, July 17
