<a href="https://colab.research.google.com/github/cdixson-ds/DS-Unit-2-Applied-Modeling/blob/master/LS_DS_231_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

# How can I evaluate the popularity of a genre? By rating, but also by number of movies made?



In [0]:
import pandas as pd

df_basic = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', sep='\t', low_memory=False)

In [42]:
df_basic.dropna()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
...,...,...,...,...,...,...,...,...,...
6503090,tt9916848,tvEpisode,Episode #3.17,Episode #3.17,0,2010,\N,\N,"Action,Drama,Family"
6503091,tt9916850,tvEpisode,Episode #3.19,Episode #3.19,0,2010,\N,\N,"Action,Drama,Family"
6503092,tt9916852,tvEpisode,Episode #3.20,Episode #3.20,0,2010,\N,\N,"Action,Drama,Family"
6503093,tt9916856,short,The Wind,The Wind,0,2015,\N,27,Short


In [24]:
df_basic.shape

(6503095, 9)

In [0]:
df_rating = pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', sep='\t', low_memory=False)

In [44]:
df_basic.isnull().sum()

tconst             0
titleType          0
primaryTitle      10
originalTitle     10
isAdult            0
startYear          0
endYear            0
runtimeMinutes     0
genres            12
dtype: int64

In [53]:
df_basic.fillna('missing')

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres
0,tt0000001,short,Carmencita,1894,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,1892,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,1892,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,1892,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,1893,"Comedy,Short"
...,...,...,...,...,...
6503090,tt9916848,tvEpisode,Episode #3.17,2010,"Action,Drama,Family"
6503091,tt9916850,tvEpisode,Episode #3.19,2010,"Action,Drama,Family"
6503092,tt9916852,tvEpisode,Episode #3.20,2010,"Action,Drama,Family"
6503093,tt9916856,short,The Wind,2015,Short


In [54]:
df_basic.isnull().sum()

tconst           0
titleType        0
primaryTitle    10
startYear        0
genres          12
dtype: int64

I did some exploration on the IMDb website to make sure these ratings were associated with the tconst and movie from the other dataset

In [27]:
df_rating.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1578
1,tt0000002,6.1,189
2,tt0000003,6.5,1242
3,tt0000004,6.2,117
4,tt0000005,6.1,1985


In [28]:
df_rating.shape

(1019087, 3)

In [0]:
df_basic = df_basic.drop(columns=['originalTitle', 'isAdult', 'originalTitle','runtimeMinutes', 'endYear'])

In [58]:
pd.set_option('display.max_rows', None)  
df_basic.head(100)

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres
0,tt0000001,short,Carmencita,1894,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,1892,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,1892,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,1892,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,1893,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,1894,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,1894,"Short,Sport"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,1894,"Documentary,Short"
8,tt0000009,movie,Miss Jerry,1894,Romance
9,tt0000010,short,Exiting the Factory,1895,"Documentary,Short"


In [31]:
df_basic.tail()

Unnamed: 0,tconst,titleType,primaryTitle,startYear,genres
6503090,tt9916848,tvEpisode,Episode #3.17,2010,"Action,Drama,Family"
6503091,tt9916850,tvEpisode,Episode #3.19,2010,"Action,Drama,Family"
6503092,tt9916852,tvEpisode,Episode #3.20,2010,"Action,Drama,Family"
6503093,tt9916856,short,The Wind,2015,Short
6503094,tt9916880,tvEpisode,Horrid Henry Knows It All,2014,"Animation,Comedy,Family"


In [49]:
df_basic['genres'].nunique()

2241

In [33]:
#Because of how many genres are attributed to each title, I am going to need to sort
#these better

df_basic['genres'].unique()

array(['Documentary,Short', 'Animation,Short', 'Animation,Comedy,Romance',
       ..., 'Musical,Reality-TV,Talk-Show', 'Animation,Short,Talk-Show',
       'Comedy,Mystery,Talk-Show'], dtype=object)

In [34]:
df_basic['genres'].value_counts()

Drama                       641201
\N                          511417
Comedy                      474100
Documentary                 318057
Talk-Show                   318005
                             ...  
Romance,Sport,Western            1
Crime,Sci-Fi,War                 1
Adventure,News,Talk-Show         1
Biography,Crime,Sport            1
Horror,Romance,Western           1
Name: genres, Length: 2241, dtype: int64

In [60]:
#Only going to keep the movie tag

df_basic['titleType'].value_counts()

tvEpisode       4594463
short            721520
movie            542068
video            253057
tvSeries         177916
tvMovie          120767
tvMiniSeries      29204
tvSpecial         27308
videoGame         24933
tvShort           11859
Name: titleType, dtype: int64

The values I want to track are:

Drama, Comedy, Documentary, Romance, Family, Animation, Crime, Action, Adventure, Mystery, Musical, Thriller, Horror, Sci-Fi, Fantasy, War, Western, Film-Noir, History, Sport, Biography, War, Documentary

In [0]:
df_basic['genres'] = df_basic['genres'].str.lower()

In [0]:
drama = df_basic['genres'].str.contains('drama')
comedy = df_basic['genres'].str.contains('comedy')
documentary = df_basic['genres'].str.contains('documentary')
romance = df_basic['genres'].str.contains('romance')
family = df_basic['genres'].str.contains('family')
animation = df_basic['genres'].str.contains('animation')
crime = df_basic['genres'].str.contains('crime')
action = df_basic['genres'].str.contains('action')
adventure = df_basic['genres'].str.contains('adventure')
mystery = df_basic['genres'].str.contains('mystery')
musical = df_basic['genres'].str.contains('musical')
thriller = df_basic['genres'].str.contains('thriller')
horror = df_basic['genres'].str.contains('horror')
sci_fi = df_basic['genres'].str.contains('sci')
fantasy = df_basic['genres'].str.contains('fantasy')
war = df_basic['genres'].str.contains('war')
western = df_basic['genres'].str.contains('western')
film_noir = df_basic['genres'].str.contains('film')
mystery = df_basic['genres'].str.contains('mystery')
history = df_basic['genres'].str.contains('history')
sport = df_basic['genres'].str.contains('sport')
biography = df_basic['genres'].str.contains('biography')

In [38]:
df_basic.isna().sum()

tconst           0
titleType        0
primaryTitle    10
startYear        0
genres          12
dtype: int64

In [0]:
df_basic.loc[drama, 'genres'] = 'drama'