# Day 4: Data Cleaning

## Tasks to do:

1. pri nacitavani zabezpecit, aby indexom bol stlpec `movieId`
1. pri nacitavani spojit tabulku zo suboru `movies.csv` s tabulkou zo suboru `links.csv`
1. overit vlastnosti tabulky pomocou metody `.info()`
1. vytvorit novy stlpec, do ktoreho vlozime rok, ktory je v nazve filmu
1. vytvorit tolko stlpcov, kolko je zanrov a kazdy z nich by bol len True-False, napr. `is_comedy`

## Nacitanie a spojenie

In [186]:
from pathlib import Path

import pandas as pd

path = Path('data/movielens/')

movies = pd.read_csv(path / 'movies.csv', index_col='movieId')
links = pd.read_csv(path / 'links.csv', index_col='movieId')
movies = movies.merge(links, how='inner', left_index=True, right_index=True)

## Info a pretypovanie

In [187]:
movies['title'] = movies['title'].astype('string')
movies['genres'] = movies['genres'].astype('string')
movies['imdbId'] = movies['imdbId'].astype('int32')
movies['tmdbId'] = movies['tmdbId'].astype('Int32')
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 1 to 193609
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   9742 non-null   string
 1   genres  9742 non-null   string
 2   imdbId  9742 non-null   int32 
 3   tmdbId  9734 non-null   Int32 
dtypes: Int32(1), int32(1), string(2)
memory usage: 572.0 KB


In [188]:
# zobrazit nulitne zaznamy pre tmdbId
movies.loc[movies['tmdbId'].isnull()]

Unnamed: 0_level_0,title,genres,imdbId,tmdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
791,"Last Klezmer: Leopold Kozlowski, His Life and ...",Documentary,113610,
1107,Loser (1991),Comedy,102336,
2851,Saturn 3 (1980),Adventure|Sci-Fi|Thriller,81454,
4051,Horrors of Spider Island (Ein Toter Hing im Ne...,Horror|Sci-Fi,56600,
26587,"Decalogue, The (Dekalog) (1989)",Crime|Drama|Romance,92337,
32600,Eros (2004),Drama,377059,
40697,Babylon 5,Sci-Fi,105946,
79299,"No. 1 Ladies' Detective Agency, The (2008)",Comedy|Crime|Mystery,874957,


## Extrahovanie roku

In [210]:
movies['title'].str.extract(r'^(?P<title>.*) \((?P<year>\d+)\)\s?$')
# movies.to_csv(path / 'filtered.csv')

Unnamed: 0_level_0,title,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story,1995
2,Jumanji,1995
3,Grumpier Old Men,1995
4,Waiting to Exhale,1995
5,Father of the Bride Part II,1995
...,...,...
193581,Black Butler: Book of the Atlantic,2017
193583,No Game No Life: Zero,2017
193585,Flint,2017
193587,Bungo Stray Dogs: Dead Apple,2018


In [211]:
pattern = r'^(?P<a>.*) \((?P<b>\d+)\)\s?$|^(?P<c>.*)$'
# pattern = r'^(?P<a>.*)(\s?\((?P<b>\d+)\)\s?)?$'
movies['title'].str.extract(pattern, expand=True)
# r.info()
# r
# r.loc[r['c'].isnull()]


Unnamed: 0_level_0,a,b,c
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story,1995,
2,Jumanji,1995,
3,Grumpier Old Men,1995,
4,Waiting to Exhale,1995,
5,Father of the Bride Part II,1995,
...,...,...,...
193581,Black Butler: Book of the Atlantic,2017,
193583,No Game No Life: Zero,2017,
193585,Flint,2017,
193587,Bungo Stray Dogs: Dead Apple,2018,


In [191]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 1 to 193609
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   9742 non-null   string
 1   genres  9742 non-null   string
 2   imdbId  9742 non-null   int32 
 3   tmdbId  9734 non-null   Int32 
 4   title2  9729 non-null   string
 5   year    9729 non-null   string
dtypes: Int32(1), int32(1), string(4)
memory usage: 724.2 KB


In [192]:
problems = movies.loc[movies['title2'].isnull()]
# problems['title'].str.extract(r'^(?P<title3>.*)( \((?P<year>\d+)\)\s?)?$')
problems


Unnamed: 0_level_0,title,genres,imdbId,tmdbId,title2,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
40697,Babylon 5,Sci-Fi,105946,,,
140956,Ready Player One,Action|Sci-Fi|Thriller,1677720,333339.0,,
143410,Hyena Road,(no genres listed),4034452,316042.0,,
147250,The Adventures of Sherlock Holmes and Doctor W...,(no genres listed),229922,127605.0,,
149334,Nocturnal Animals,Drama|Thriller,4550098,340666.0,,
156605,Paterson,(no genres listed),5247022,370755.0,,
162414,Moonlight,Drama,4975722,376867.0,,
167570,The OA,(no genres listed),4635282,432192.0,,
171495,Cosmos,(no genres listed),81846,409926.0,,
171631,Maria Bamford: Old Baby,(no genres listed),6264596,455601.0,,


## Zanre filmov

In [194]:
# get genres first
genres = set()
def extract_genres(row):
    global genres
    genres.update(row.split('|'))
    
movies['genres'].map(extract_genres)
genres.remove('(no genres listed)')
print(genres)

{'Comedy', 'Drama', 'IMAX', 'Fantasy', 'Musical', 'Sci-Fi', 'Children', 'Adventure', 'Romance', 'Documentary', 'Film-Noir', 'War', 'Action', 'Horror', 'Thriller', 'Western', 'Crime', 'Mystery', 'Animation'}


In [195]:
# create columns based on genres
for genre in genres:
    movies[f'is_{genre.lower()}'] = movies['genres'].str.contains(genre, regex=False)
movies.columns

Index(['title', 'genres', 'imdbId', 'tmdbId', 'title2', 'year', 'is_action',
       'is_adventure', 'is_animation', 'is_children', 'is_comedy',
       'is_documentary', 'is_drama', 'is_fantasy', 'is_horror', 'is_musical',
       'is_mystery', 'is_romance', 'is_scifi', 'is_thriller', 'is_imax',
       'is_sci-fi', 'is_film-noir', 'is_war', 'is_western', 'is_crime'],
      dtype='object')

In [198]:
# drop genres column
movies.drop('genres', axis='columns', inplace=True)

In [200]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9742 entries, 1 to 193609
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           9742 non-null   string 
 1   imdbId          9742 non-null   int32  
 2   tmdbId          9734 non-null   Int32  
 3   title2          9729 non-null   string 
 4   year            9729 non-null   string 
 5   is_action       9742 non-null   boolean
 6   is_adventure    9742 non-null   boolean
 7   is_animation    9742 non-null   boolean
 8   is_children     9742 non-null   boolean
 9   is_comedy       9742 non-null   boolean
 10  is_documentary  9742 non-null   boolean
 11  is_drama        9742 non-null   boolean
 12  is_fantasy      9742 non-null   boolean
 13  is_horror       9742 non-null   boolean
 14  is_musical      9742 non-null   boolean
 15  is_mystery      9742 non-null   boolean
 16  is_romance      9742 non-null   boolean
 17  is_scifi        9742 non-null  