# LABS-9: Analytics Project

In this notebook you will run and edit the code to perform some data cleaning and run a basic kNN model.

**Data**\
This dataset comes from IMDB and can be accessed on [Kaggle](https://www.kaggle.com/datasets/ashpalsingh1525/imdb-movies-dataset).

## Set up environment

In [310]:
## import packages
# ADD what each package is

import pandas as pd #data ingestion & cleaning
import numpy as np #numbers

# modeling 
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [311]:
# Read in data
data = pd.read_csv("imdb_movies.csv")

In [312]:
data

Unnamed: 0,names,date_x,score,genre,overview,crew,orig_title,status,orig_lang,budget_x,revenue,country
0,Creed III,03/02/2023,73.0,"Drama, Action","After dominating the boxing world, Adonis Cree...","Michael B. Jordan, Adonis Creed, Tessa Thompso...",Creed III,Released,English,75000000.0,2.716167e+08,AU
1,Avatar: The Way of Water,12/15/2022,78.0,"Science Fiction, Adventure, Action",Set more than a decade after the events of the...,"Sam Worthington, Jake Sully, Zoe Saldaña, Neyt...",Avatar: The Way of Water,Released,English,460000000.0,2.316795e+09,AU
2,The Super Mario Bros. Movie,04/05/2023,76.0,"Animation, Adventure, Family, Fantasy, Comedy","While working underground to fix a water main,...","Chris Pratt, Mario (voice), Anya Taylor-Joy, P...",The Super Mario Bros. Movie,Released,English,100000000.0,7.244590e+08,AU
3,Mummies,01/05/2023,70.0,"Animation, Comedy, Family, Adventure, Fantasy","Through a series of unfortunate events, three ...","Óscar Barberán, Thut (voice), Ana Esther Albor...",Momias,Released,"Spanish, Castilian",12300000.0,3.420000e+07,AU
4,Supercell,03/17/2023,61.0,Action,Good-hearted teenager William always lived in ...,"Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quin...",Supercell,Released,English,77000000.0,3.409420e+08,US
...,...,...,...,...,...,...,...,...,...,...,...,...
10173,20th Century Women,12/28/2016,73.0,Drama,"In 1979 Santa Barbara, California, Dorothea Fi...","Annette Bening, Dorothea Fields, Lucas Jade Zu...",20th Century Women,Released,English,7000000.0,9.353729e+06,US
10174,Delta Force 2: The Colombian Connection,08/24/1990,54.0,Action,When DEA agents are taken captive by a ruthles...,"Chuck Norris, Col. Scott McCoy, Billy Drago, R...",Delta Force 2: The Colombian Connection,Released,English,9145817.8,6.698361e+06,US
10175,The Russia House,12/21/1990,61.0,"Drama, Thriller, Romance","Barley Scott Blair, a Lisbon-based editor of R...","Sean Connery, Bartholomew 'Barley' Scott Blair...",The Russia House,Released,English,21800000.0,2.299799e+07,US
10176,Darkman II: The Return of Durant,07/11/1995,55.0,"Action, Adventure, Science Fiction, Thriller, ...",Darkman and Durant return and they hate each o...,"Larry Drake, Robert G. Durant, Arnold Vosloo, ...",Darkman II: The Return of Durant,Released,English,116000000.0,4.756613e+08,US


## Data Cleaning & Model Prep

Before building a machine learning model, it is essential to clean and format the data. Raw data often contains missing values, inconsistent formats, or irrelevant information that can negatively impact or break a model. 
Many algorithms, including kNN, require numeric input or specificly formatted categorical data. By cleaning the data (removing or imputing missing values, converting strings to categorical variables, and creating dummy variables), we ensure that our dataset is structured in a way that the model can interpret and learn from effectively. 

Proper data preparation leads to more accurate, reliable, and interpretable results.

There are many decisions that get made throughout this process and there is often no "right" answer - so documentating why you do things as you clean data is **key**.

### Missing Values

We saw in our design lab that some of our columns are missing values. Many models can not tolerate missing data (they will break the model), so we have to deal with these before passing the data through to our model.

We can use the [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) method to see what columns are missing data. Run this below (look back at LABS-06 if you don't remember how).

In [313]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10178 entries, 0 to 10177
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   names       10178 non-null  object 
 1   date_x      10178 non-null  object 
 2   score       10178 non-null  float64
 3   genre       10093 non-null  object 
 4   overview    10178 non-null  object 
 5   crew        10122 non-null  object 
 6   orig_title  10178 non-null  object 
 7   status      10178 non-null  object 
 8   orig_lang   10178 non-null  object 
 9   budget_x    10178 non-null  float64
 10  revenue     10178 non-null  float64
 11  country     10178 non-null  object 
dtypes: float64(3), object(9)
memory usage: 954.3+ KB


2 columns are missing data: `genre` and `crew`. 

Since we have a large data set for kNN, we can drop the relatively few rows that are missing data using .dropna()

In [314]:
## make a new df to make changes to
model_data = data.copy()

In [315]:
model_data.dropna(inplace=True)

In [345]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10052 entries, 0 to 10177
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   score      10052 non-null  category
 1   status     10052 non-null  object  
 2   budget_x   10052 non-null  float64 
 3   revenue    10052 non-null  float64 
 4   country    10052 non-null  object  
 5   top_genre  10052 non-null  object  
 6   top_lang   10052 non-null  object  
 7   year       10052 non-null  int32   
dtypes: category(1), float64(2), int32(1), object(4)
memory usage: 598.9+ KB


### Look at some columns

*narrative about things we're looking for to clean up*

In [317]:
data['status'].value_counts()

status
Released           10131
Post Production       31
In Production         16
Name: count, dtype: int64

*narrative about levels of status*

In [318]:
data['country'].value_counts()

country
AU    4885
US    2750
JP     538
KR     361
FR     222
GB     174
ES     153
HK     125
IT     123
MX     105
CN      93
DE      88
CA      67
RU      52
PH      43
IN      43
AR      41
BR      38
TH      30
DK      24
PL      22
TR      20
NL      16
NO      16
CO      14
TW      13
ID      12
IE      11
SE       9
CL       9
BE       7
PE       7
GR       6
FI       6
SU       5
CH       5
UA       4
SG       4
ZA       3
HU       3
VN       3
IS       2
GT       2
PR       2
AT       2
UY       2
CZ       2
SK       2
IR       2
MY       2
DO       1
IL       1
BY       1
BO       1
MU       1
PY       1
LV       1
XC       1
PT       1
KH       1
Name: count, dtype: int64

#### collapse columns with too many levels

In [None]:
model_data['country'] = model_data['country'].apply(lambda x: x if model_data['country'].value_counts()[x] > 100 else 'other')

In [320]:
model_data['country'].value_counts()


country
AU       4880
US       2716
other     709
JP        502
KR        358
FR        219
GB        172
ES        151
IT        123
HK        120
MX        102
Name: count, dtype: int64

In [321]:
## look at other problematic columns

In [322]:
for col in ['genre', 'orig_lang']:
    print(f"Value counts for {col}:")
    print(model_data[col].value_counts())
    print("\n")

Value counts for genre:
genre
Drama                                          556
Comedy                                         373
Drama, Romance                                 268
Horror                                         258
Horror, Thriller                               202
                                              ... 
Adventure, Family, Mystery, Science Fiction      1
Mystery, Drama, Action, Crime                    1
Thriller, Comedy, Action                         1
Drama, Romance, Horror                           1
Fantasy, Drama, Comedy, Science Fiction          1
Name: count, Length: 2300, dtype: int64


Value counts for orig_lang:
orig_lang
English                                7381
Japanese                                675
Spanish, Castilian                      388
Korean                                  384
French                                  282
Chinese                                 144
Italian                                 142
Cantonese            

#### grab the top values for some 

*narrative about what we can use from this info*

In [324]:
def get_top_value(old_column_name, new_column_name):
    """
    Function to extract the first value from a column that contains multiple comma seperated values
    Appends a new column to the dataframe
    """

    col = list(model_data[old_column_name].values)

    top_list = []
    for item in col:
        item = str(item).split(",")
        item1 = item[0]
        top_list.append(item1)

    model_data[new_column_name] = top_list 

In [325]:
get_top_value('genre', 'top_genre')
get_top_value('orig_lang', 'top_lang')

In [327]:
# look at value counts again
for col in ['top_genre', 'top_lang']:
    print(f"Value counts for {col}:")
    print(model_data[col].value_counts())
    print("\n")

Value counts for top_genre:
top_genre
Drama              1865
Action             1563
Comedy             1377
Horror              931
Animation           885
Thriller            577
Adventure           571
Romance             413
Crime               371
Family              333
Science Fiction     313
Fantasy             263
Documentary         176
Mystery             108
War                  77
Music                76
Western              72
History              46
TV Movie             35
Name: count, dtype: int64


Value counts for top_lang:
top_lang
English           7381
Japanese           675
Spanish            388
Korean             384
French             282
Chinese            144
Italian            142
Cantonese          141
German              89
Russian             65
Tagalog             42
Portuguese          35
Thai                33
Norwegian           29
Hindi               26
Polish              26
Danish              23
Swedish             22
Turkish             21
Dutch

In [328]:
#collapse top_lang
model_data['top_lang'] = model_data['top_lang'].apply(lambda x: x if model_data['top_lang'].value_counts()[x] > 10 else 'other')

### Reformat columns

*narrative about what format columns need to be in for model*

#### Date

*narrative about how we will use the year only & reformat*

In [329]:
model_data['date_x'] = pd.to_datetime(model_data['date_x'])

In [330]:
model_data['year'] = model_data['date_x'].dt.year

#### Score

*Narrative about predicting score as high/low - avoid saying it is the target*

In [331]:
## reformat score
model_data['score'] = model_data['score'].apply(lambda x: 'high' if model_data['score'].value_counts()[x] > 70 else 'low')
model_data

Unnamed: 0,names,date_x,score,genre,overview,crew,orig_title,status,orig_lang,budget_x,revenue,country,top_genre,top_lang,year
0,Creed III,2023-03-02,high,"Drama, Action","After dominating the boxing world, Adonis Cree...","Michael B. Jordan, Adonis Creed, Tessa Thompso...",Creed III,Released,English,75000000.0,2.716167e+08,AU,Drama,English,2023
1,Avatar: The Way of Water,2022-12-15,high,"Science Fiction, Adventure, Action",Set more than a decade after the events of the...,"Sam Worthington, Jake Sully, Zoe Saldaña, Neyt...",Avatar: The Way of Water,Released,English,460000000.0,2.316795e+09,AU,Science Fiction,English,2022
2,The Super Mario Bros. Movie,2023-04-05,high,"Animation, Adventure, Family, Fantasy, Comedy","While working underground to fix a water main,...","Chris Pratt, Mario (voice), Anya Taylor-Joy, P...",The Super Mario Bros. Movie,Released,English,100000000.0,7.244590e+08,AU,Animation,English,2023
3,Mummies,2023-01-05,high,"Animation, Comedy, Family, Adventure, Fantasy","Through a series of unfortunate events, three ...","Óscar Barberán, Thut (voice), Ana Esther Albor...",Momias,Released,"Spanish, Castilian",12300000.0,3.420000e+07,AU,Animation,Spanish,2023
4,Supercell,2023-03-17,high,Action,Good-hearted teenager William always lived in ...,"Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quin...",Supercell,Released,English,77000000.0,3.409420e+08,US,Action,English,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10173,20th Century Women,2016-12-28,high,Drama,"In 1979 Santa Barbara, California, Dorothea Fi...","Annette Bening, Dorothea Fields, Lucas Jade Zu...",20th Century Women,Released,English,7000000.0,9.353729e+06,US,Drama,English,2016
10174,Delta Force 2: The Colombian Connection,1990-08-24,high,Action,When DEA agents are taken captive by a ruthles...,"Chuck Norris, Col. Scott McCoy, Billy Drago, R...",Delta Force 2: The Colombian Connection,Released,English,9145817.8,6.698361e+06,US,Action,English,1990
10175,The Russia House,1990-12-21,high,"Drama, Thriller, Romance","Barley Scott Blair, a Lisbon-based editor of R...","Sean Connery, Bartholomew 'Barley' Scott Blair...",The Russia House,Released,English,21800000.0,2.299799e+07,US,Drama,English,1990
10176,Darkman II: The Return of Durant,1995-07-11,high,"Action, Adventure, Science Fiction, Thriller, ...",Darkman and Durant return and they hate each o...,"Larry Drake, Robert G. Durant, Arnold Vosloo, ...",Darkman II: The Return of Durant,Released,English,116000000.0,4.756613e+08,US,Action,English,1995


In [332]:
model_data['score'] = model_data['score'].astype('category')

In [333]:
model_data.dtypes

names                 object
date_x        datetime64[ns]
score               category
genre                 object
overview              object
crew                  object
orig_title            object
status                object
orig_lang             object
budget_x             float64
revenue              float64
country               object
top_genre             object
top_lang              object
year                   int32
dtype: object

### drop the columns you won't use

*narrative about which columns we can use and which we can not*

In [335]:
model_data = model_data.drop(columns=['date_x', 'names', 'genre', 'overview', 'crew', 'orig_title', 'orig_lang'])
model_data.head()

Unnamed: 0,score,status,budget_x,revenue,country,top_genre,top_lang,year
0,high,Released,75000000.0,271616700.0,AU,Drama,English,2023
1,high,Released,460000000.0,2316795000.0,AU,Science Fiction,English,2022
2,high,Released,100000000.0,724459000.0,AU,Animation,English,2023
3,high,Released,12300000.0,34200000.0,AU,Animation,Spanish,2023
4,high,Released,77000000.0,340942000.0,US,Action,English,2023


### Train/test split

In [336]:
# features: all columns except 'score'
features = model_data.drop('score', axis=1)
# Target: score column
target = model_data['score']

In [337]:
# make columns 

#### Dummy variables

*narrative about why we need dummy vars - connect to distance + formats referenced above*

In [338]:
features = pd.get_dummies(features)

#### Split data

*narrative about why we need to split data, what train does & what test does*

In [339]:
# train test split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=45)


## make model

*add narrative about knn object and drop down with attributes*

In [341]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(features_train, target_train)

0,1,2
,n_neighbors,5
,weights,'uniform'
,algorithm,'auto'
,leaf_size,30
,p,2
,metric,'minkowski'
,metric_params,
,n_jobs,


In [None]:
target_predicted = knn.predict(features_test)

In [344]:
print("Accuracy:", accuracy_score(target_test, target_predicted))

Accuracy: 0.9408254599701641
