# Preprocessing and cleaning
The goal of this notebook is to preprocess the `v1_movies_cleaned.csv` file obtained after features augmentation and refinement by GPT in order to be able to easely work with it.

- [Data analysis](#data-analysis)

In [1]:
import pandas as pd
import numpy as np
import re

from src.utils.helpers import convert_csv
from src.constants import *

## Data exploration and cleaning

In [2]:
# Load the data
movies = pd.read_csv(DATA_FOLDER_PREPROCESSED + "v2_movies_cleaned.csv")
movies.head()

Unnamed: 0,wikipedia_id,freebase_id,title,languages,countries,genres,keywords,release_date,runtime,plot_summary,cold_war_side,character_western_bloc_representation,character_eastern_bloc_representation,western_bloc_values,eastern_bloc_values,theme
0,4213160.0,/m/0bq8q8,$,,['Soviet Union'],"['Comedy', 'Crime', 'Drama']",,1971,119.0,"Set in Hamburg, West Germany, several criminal...","""Western""","['Joe Collins', 'American bank security consul...","['Dawn Divine', 'hooker with a heart of gold',...",['None'],"['Resourcefulness', 'cleverness', 'individuali...",['None']
1,,,"$1,000 on the Black","['Italiano', 'Deutsch']","['Italy', 'Germany']",['Western'],,1966,104.0,Johnny Liston has just been released from pris...,"""Eastern""",['None'],"['Sartana', 'villainous', 'oppressive', 'cruel...","['Johnny Liston', 'justice', 'determination', ...","['Justice', 'revenge', 'oppressed vs. oppresso...","['Terror', 'betrayal', 'familial conflict', 'c..."
2,,,"$10,000 Blood Money",,['Soviet Union'],"['Western', 'Drama']",,1967,,Hired by a Mexican landowner to rescue his dau...,"""None""",['None'],['None'],['None'],['None'],"['crime', 'betrayal', 'revenge', 'bounty hunte..."
3,,,"$100,000 for Ringo",['Italiano'],['Italy'],"['Western', 'Drama']","['spaghetti western', 'whipping']",1965,98.0,A stranger rides into Rainbow Valley where he'...,"""None""",['None'],['None'],['None'],['None'],"['Western', 'Civil War', 'mistaken identity', ..."
4,,,'Anna' i wampir,,['Soviet Union'],['Crime'],,1982,,"Silesia in Poland, late 60s. Bodies of vicious...","""None""",['None'],['None'],['None'],['None'],"['murder mystery', 'horror', 'fog', 'Poland', ..."


In [3]:
for col in movies.columns:
    print(col, type(movies[col][0]))

wikipedia_id <class 'numpy.float64'>
freebase_id <class 'str'>
title <class 'str'>
languages <class 'float'>
countries <class 'str'>
genres <class 'str'>
keywords <class 'float'>
release_date <class 'numpy.int64'>
runtime <class 'numpy.float64'>
plot_summary <class 'str'>
cold_war_side <class 'str'>
character_western_bloc_representation <class 'str'>
character_eastern_bloc_representation <class 'str'>
western_bloc_values <class 'str'>
eastern_bloc_values <class 'str'>
theme <class 'str'>


We observe that the values in the columns have all been converted to strings (except for `year_release_date`). That's because CSV files are plain text files, and they don't support complex data types like lists or dictionaries directly. Let's convert them into more convenient types.

In [4]:
convert_csv(movies).head()

Unnamed: 0,wikipedia_id,freebase_id,title,languages,countries,genres,keywords,release_date,runtime,plot_summary,cold_war_side,character_western_bloc_representation,character_eastern_bloc_representation,western_bloc_values,eastern_bloc_values,theme
0,4213160.0,/m/0bq8q8,$,,[Soviet Union],"[Comedy, Crime, Drama]",,1971,119.0,"Set in Hamburg, West Germany, several criminal...",Western,"[Joe Collins, American bank security consultan...","[Dawn Divine, hooker with a heart of gold, cun...",[None],"[Resourcefulness, cleverness, individualism, h...",[None]
1,,,"$1,000 on the Black","[Italiano, Deutsch]","[Italy, Germany]",[Western],,1966,104.0,Johnny Liston has just been released from pris...,Eastern,[None],"[Sartana, villainous, oppressive, cruel, arche...","[Johnny Liston, justice, determination, resili...","[Justice, revenge, oppressed vs. oppressor, re...","[Terror, betrayal, familial conflict, crime, r..."
2,,,"$10,000 Blood Money",,[Soviet Union],"[Western, Drama]",,1967,,Hired by a Mexican landowner to rescue his dau...,,[None],[None],[None],[None],"[crime, betrayal, revenge, bounty hunter, heis..."
3,,,"$100,000 for Ringo",[Italiano],[Italy],"[Western, Drama]","[spaghetti western, whipping]",1965,98.0,A stranger rides into Rainbow Valley where he'...,,[None],[None],[None],[None],"[Western, Civil War, mistaken identity, treasu..."
4,,,'Anna' i wampir,,[Soviet Union],[Crime],,1982,,"Silesia in Poland, late 60s. Bodies of vicious...",,[None],[None],[None],[None],"[murder mystery, horror, fog, Poland, 1960s]"


Let's homogeneize the `countries` column

In [5]:
countries_representation = {
    'Soviet Union': 'Russia',
    'Soviet occupation zone': 'Russia',
    'Ukrainian SSR': 'Ukraine',
    'Ukranian SSR': 'Ukraine',
    'Uzbek SSR': 'Uzbekistan',
    'Georgian SSR': 'Georgia',
    'West Germany': 'Germany',
    'German Democratic Republic': 'Germany',
    'East Germany': 'Germany',
    'United Kingdom': 'United Kingdom',
    'England': 'United Kingdom',
    'Wales': 'United Kingdom',
    'Scotland': 'United Kingdom',
    'Northern Ireland': 'United Kingdom',
    'Socialist Federal Republic of Yugoslavia': 'Yugoslavia',
    'Federal Republic of Yugoslavia': 'Yugoslavia',
    'Republic of China': 'Taiwan',
    'South Korea': 'Korea',
    'North Korea': 'Korea',
    'Kingdom of Italy': 'Italy',
    'Republic of Macedonia': 'Macedonia',
    'Libyan Arab Jamahiriya': 'Libya',
    'Cote DIvoire': 'Côte d\'Ivoire',
    'Kingdom of Great Britain': 'United Kingdom',
    'Malayalam Language': 'India',
    'Syrian Arab Republic': 'Syria',
    'Kyrgyz Republic': 'Kyrgyzstan',
    'Slovak Republic': 'Czechoslovakia'
}

def preprocess_countries(row):
    # row['countries'] = clean_column_values(row['countries'])
    row['countries'] = set([countries_representation.get(string, string) for string in row['countries']]) if isinstance(row['countries'], list) else row['countries']
    # convert back to list
    row['countries'] = list(row['countries']) if isinstance(row['countries'], set) else row['countries']
    return row

In [6]:
movies = movies.apply(preprocess_countries, axis=1)

In [7]:
movies['languages'].explode().unique()

array([nan, 'Italiano', 'Deutsch', 'English', 'Français', 'Český',
       'Português', 'हिन्दी', '广州话 / 廣州話', 'Pусский', 'No', 'Español',
       '日本語', 'Magyar', 'Dansk', 'Polski', '한국어/조선말', 'Română',
       'Tiếng Việt', '普通话', 'Latin', 'Nederlands', '', 'svenska',
       'Latviešu', 'ελληνικά', 'বাংলা', 'Український', 'עִבְרִית',
       'shqip', 'suomi', 'فارسی', 'Azərbaycan', 'தமிழ்', 'Russian',
       'العربية', 'French', 'اردو', 'తెలుగు', 'ქართული', 'Eesti',
       'Türkçe', 'Íslenska', 'Hrvatski', 'Srpski', 'Cymraeg',
       'Slovenčina', 'Wolof', 'Italian', 'Afrikaans', 'Català', 'Swedish',
       'Spanish', 'беларуская мова', 'Slovenščina', 'Norsk', 'Fulfulde',
       'Bosanski', 'български език', 'German', 'euskera', 'Polish',
       'Czech', 'ภาษาไทย', 'Japanese', 'Kiswahili', 'Gaeilge', 'Finnish',
       'Norwegian', 'Greek', 'Lietuvi\x9akai', 'Esperanto', 'қазақ',
       'Bahasa melayu', 'Bahasa indonesia', 'Malti', 'Chinese', 'Danish',
       'isiZulu', 'Persian', 'Bamana

In [8]:
movies['countries'].explode().unique()

array(['Russia', 'Germany', 'Italy', 'United States of America',
       'Estonia', 'Ukraine', 'Switzerland', 'Lithuania', 'France',
       'Egypt', 'United Kingdom', 'India', 'Hong Kong', 'Costa Rica',
       'Spain', 'Canada', 'Latvia', 'Hungary', 'Japan', 'Poland',
       'Denmark', 'Korea', 'Croatia', 'Austria', 'Philippines',
       'Portugal', 'Taiwan', 'Georgia', 'Romania', 'Australia',
       'South Africa', 'Luxembourg', 'Sweden', 'Ireland', 'Greece',
       'Uruguay', 'Argentina', 'Belgium', 'Netherlands', 'Czech Republic',
       'Bangladesh', 'New Zealand', 'Albania', 'Finland', 'Iceland',
       'Liechtenstein', 'Iran', 'Zimbabwe', 'Norway',
       'Bosnia and Herzegovina', 'Cuba', 'Peru', 'Israel', 'Uzbekistan',
       'China', 'Mexico', 'Azerbaijan', 'Bolivia', 'Brazil', 'Venezuela',
       'Serbia', 'Macedonia', 'Monaco', 'Slovakia', 'Turkey', 'Senegal',
       'Qatar', 'Tunisia', "Cote D'Ivoire", 'Belarus', 'Armenia',
       'Algeria', 'Thailand', 'Colombia', 'Chile', '

In [9]:
movies['cold_war_side'].explode().unique()

array(['Western', 'Eastern', 'None'], dtype=object)

Even if the values for the columns `countries` and `cold_war_side` seems to be fine. We observe some inconcistencies for `languages`.

In [10]:
languages_translation = {
    '广州话/廣州話': 'Chinese',
    '广州话 / 廣州話': 'Chinese',
    '日本語': 'Japanese',
    'Japan': 'Japanese',
    '普通话': 'Chinese',
    '한국어/조선말': 'Korean',
    'ภาษาไทย': 'Thai',
    'हिन्दी': 'Indian',
    'தமிழ்': 'Indian',
    'TiếngViệt': 'Vietnamese',
    'Tiếng Việt': 'Vietnamese',
    'العربية': 'Arabic',
    'اردو': 'Indian',
    'българскиезик': 'Bulgarian',
    'Pусский': 'Russian',
    'беларускаямова': 'Belarusian',
    'Український': 'Ukrainian',
    'Srpski': 'Serbian',
    'Slovenčina': 'Slovak',
    'Français': 'French',
    'France': 'French',
    'Deutsch': 'German',
    'Italiano': 'Italian',
    'Español': 'Spanish',
    'Polski': 'Polish',
    'Standard Mandarin': 'Chinese',
    'Mandarin Chinese': 'Chinese',
    'Mandarin': 'Chinese',
    'Português': 'Portuguese',
    'Standard Cantonese': 'Chinese',
    'Cantonese': 'Chinese',
    'suomi': 'Finnish',
    'Magyar': 'Hungarian',
    'Bosanski': 'Bosnian',
    'svenska': 'Swedish',
    'ελληνικά': 'Greek',
    'Český': 'Czech',
    'Dansk': 'Danish',
    'Nederlands': 'Dutch',
    'עִבְרִית': 'Hebrew',
    'American English': 'English',
    'Türkçe': 'Turkish',
    'Tagalog': 'Filipino',
    'Khmer': 'Cambodian',
    'Hindi': 'Indian',
    'Tamil': 'Indian',
    'Telugu': 'Indian',
    'Urdu': 'Indian',
    'Oriya': 'Indian',
    'Eesti': 'Estonian',
    'Română': 'Romanian',
    'Romani': 'Romanian',
    'Norsk': 'Norwegian',
    'No': 'Norwegian',
    'Íslenska': 'Icelandic',
    'Bahasa indonesia': 'Indonesian',
    'Català': 'Spanish',
    'Inuktitut': 'Inuit',
    'Hakka': 'Chinese',
    'Sicilian': 'Italian',
    'Marathi': 'Indian',
    'Hrvatski': 'Croatian',
    'shqip': 'Albanian',
    'isiZulu': 'Zulu', 
    'Latviešu': 'Latvian',
    'ქართული': 'Georgian',
    'Australian English': 'English',
    'Bahasamelayu': 'Malay',
    'Lietuvi\\x9akai'.encode('latin1').decode('unicode_escape'): 'Lithuanian', # \x9a is an escape sequence
    'Farsi, Western': 'Persian',
    'فارسی': 'Persian',
    'беларуская мова': 'Belarusian',
    'български език': 'Bulgarian',
    'Swiss German': 'German',
    'Brazilian Portuguese': 'Portuguese',
    'euskera': 'Basque',
    'қазақ': 'Kazakh',
    'Bahasa melayu': 'Malay',
    'French Sign': 'Sign Language',
    'American Sign': 'Sign Language',
    'Hokkien': 'Chinese',
    'Min Nan': 'Chinese',
    'Chinese, Hakka': 'Chinese',
    'Ancient Greek': 'Greek',
    'Gaelic': 'Scottish Gaelic',
    'Scottish Gaelic': 'Scottish Gaelic',
    'Zulu': 'Zulu',
    'Lithuanian': 'Lithuanian',
    'Standard Tibetan': 'Tibetan',
    'Saami, North': 'Sami',
    'Bamanankan': 'Bambara',
    'Fulfulde, Adamawa': 'Fula',
    'Brazilian Portuguese': 'Portuguese',
    'South African English': 'English',
    'Jamaican Creole English': 'Jamaican Creole',
    'Classical Arabic': 'Arabic',
    'Frisian, Western': 'Frisian',
    'Yolngu Matha': 'Yolngu Matha',
    'Cheyenne': 'Cheyenne',
    'Crow': 'Crow',
    'Scanian': 'Swedish',
    'Palawa kani': 'Palawa kani',
    'Kiswahili': 'Swahili',
    'Māori': 'Maori',
    'বাংলা': 'Bengali',
    'తెలుగు': 'Indian',
    'Taiwanese': 'Chinese',
    'Shanghainese': 'Chinese',
    'Azərbaycan': 'Azerbaijani',
    'Cymraeg': 'Welsh',
    'Hariyani': 'Indian',
    'Slovenščina': 'Slovenian',
    'Maya, Yucatán': 'Maya',
    'Egyptian Arabic': 'Arabic',
    'Assyrian Neo-Aramaic': 'Aramaic',
    'Crow': 'Native American languages',
    'Cheyenne': 'Native American languages',
    'Hopi': 'Native American languages',
    'Pawnee': 'Native American languages',
    'Mohawk': 'Native American languages',
    'Algonquin': 'Native American languages',
    'Cree': 'Native American languages',
    'Navajo': 'Native American languages',
    'Sioux': 'Native American languages',
    'Khmer, Central': 'Cambodian'
}


In [11]:
print(len(movies['languages'].explode().unique()))

83


In [12]:
def remove_language_suffix(language_set):
    if isinstance(language_set, float):
        return np.nan
    else:
        cleaned_set = {
        re.sub(r'[\\\"\']', '',  # Remove unwanted characters
               re.sub(r'\blanguages?\b', '', lang, flags=re.IGNORECASE)  # Remove "language"/"languages"
               ).strip()
            for lang in language_set
        }
        return cleaned_set

movies['languages'] = movies['languages'].apply(remove_language_suffix)

movies['languages'] = movies['languages'].apply(lambda x: 
    set([languages_translation.get(string, string) for string in x]) if isinstance(x, set) else x)

print(len(movies['languages'].explode().unique()))
movies['languages'].explode().unique()

61


array([nan, 'Italian', 'German', 'English', 'French', 'Czech',
       'Portuguese', 'Indian', 'Chinese', 'Russian', 'Norwegian',
       'Spanish', 'Japanese', 'Hungarian', 'Polish', 'Danish', 'Korean',
       'Romanian', 'Vietnamese', 'Latin', 'Dutch', '', 'Swedish',
       'Latvian', 'Greek', 'Bengali', 'Ukrainian', 'Hebrew', 'Albanian',
       'Finnish', 'Persian', 'Azerbaijani', 'Arabic', 'Georgian',
       'Estonian', 'Turkish', 'Icelandic', 'Croatian', 'Serbian', 'Welsh',
       'Slovak', 'Wolof', 'Afrikaans', 'Belarusian', 'Slovenian',
       'Fulfulde', 'Bosnian', 'Bulgarian', 'Basque', 'Thai', 'Swahili',
       'Gaeilge', 'Lithuanian', 'Esperanto', 'Kazakh', 'Malay',
       'Indonesian', 'Malti', 'Zulu', 'Bambara', '??????'], dtype=object)

In [13]:
movies['languages'] = movies['languages'].apply(lambda x: 
                                                [lang for lang in x if lang != '' and pd.notna(lang) and lang != '??????']
                                                if isinstance(x, set) else x)
print(len(movies['languages'].explode().unique()))
movies['languages'].explode().unique()

59


array([nan, 'Italian', 'German', 'English', 'French', 'Czech',
       'Portuguese', 'Indian', 'Chinese', 'Russian', 'Norwegian',
       'Spanish', 'Japanese', 'Hungarian', 'Polish', 'Danish', 'Korean',
       'Romanian', 'Vietnamese', 'Latin', 'Dutch', 'Swedish', 'Latvian',
       'Greek', 'Bengali', 'Ukrainian', 'Hebrew', 'Albanian', 'Finnish',
       'Persian', 'Azerbaijani', 'Arabic', 'Georgian', 'Estonian',
       'Turkish', 'Icelandic', 'Croatian', 'Serbian', 'Welsh', 'Slovak',
       'Wolof', 'Afrikaans', 'Belarusian', 'Slovenian', 'Fulfulde',
       'Bosnian', 'Bulgarian', 'Basque', 'Thai', 'Swahili', 'Gaeilge',
       'Lithuanian', 'Esperanto', 'Kazakh', 'Malay', 'Indonesian',
       'Malti', 'Zulu', 'Bambara'], dtype=object)

In [14]:
movies.head()

Unnamed: 0,wikipedia_id,freebase_id,title,languages,countries,genres,keywords,release_date,runtime,plot_summary,cold_war_side,character_western_bloc_representation,character_eastern_bloc_representation,western_bloc_values,eastern_bloc_values,theme
0,4213160.0,/m/0bq8q8,$,,[Russia],"[Comedy, Crime, Drama]",,1971,119.0,"Set in Hamburg, West Germany, several criminal...",Western,"[Joe Collins, American bank security consultan...","[Dawn Divine, hooker with a heart of gold, cun...",[None],"[Resourcefulness, cleverness, individualism, h...",[None]
1,,,"$1,000 on the Black","[Italian, German]","[Germany, Italy]",[Western],,1966,104.0,Johnny Liston has just been released from pris...,Eastern,[None],"[Sartana, villainous, oppressive, cruel, arche...","[Johnny Liston, justice, determination, resili...","[Justice, revenge, oppressed vs. oppressor, re...","[Terror, betrayal, familial conflict, crime, r..."
2,,,"$10,000 Blood Money",,[Russia],"[Western, Drama]",,1967,,Hired by a Mexican landowner to rescue his dau...,,[None],[None],[None],[None],"[crime, betrayal, revenge, bounty hunter, heis..."
3,,,"$100,000 for Ringo",[Italian],[Italy],"[Western, Drama]","[spaghetti western, whipping]",1965,98.0,A stranger rides into Rainbow Valley where he'...,,[None],[None],[None],[None],"[Western, Civil War, mistaken identity, treasu..."
4,,,'Anna' i wampir,,[Russia],[Crime],,1982,,"Silesia in Poland, late 60s. Bodies of vicious...",,[None],[None],[None],[None],"[murder mystery, horror, fog, Poland, 1960s]"


In [15]:
# deleted unnecessary columns
movies = movies.drop(columns=['wikipedia_id', 'freebase_id', 'keywords', 'runtime', 'plot_summary'])
movies.head()

Unnamed: 0,title,languages,countries,genres,release_date,cold_war_side,character_western_bloc_representation,character_eastern_bloc_representation,western_bloc_values,eastern_bloc_values,theme
0,$,,[Russia],"[Comedy, Crime, Drama]",1971,Western,"[Joe Collins, American bank security consultan...","[Dawn Divine, hooker with a heart of gold, cun...",[None],"[Resourcefulness, cleverness, individualism, h...",[None]
1,"$1,000 on the Black","[Italian, German]","[Germany, Italy]",[Western],1966,Eastern,[None],"[Sartana, villainous, oppressive, cruel, arche...","[Johnny Liston, justice, determination, resili...","[Justice, revenge, oppressed vs. oppressor, re...","[Terror, betrayal, familial conflict, crime, r..."
2,"$10,000 Blood Money",,[Russia],"[Western, Drama]",1967,,[None],[None],[None],[None],"[crime, betrayal, revenge, bounty hunter, heis..."
3,"$100,000 for Ringo",[Italian],[Italy],"[Western, Drama]",1965,,[None],[None],[None],[None],"[Western, Civil War, mistaken identity, treasu..."
4,'Anna' i wampir,,[Russia],[Crime],1982,,[None],[None],[None],[None],"[murder mystery, horror, fog, Poland, 1960s]"


In [16]:
movies["cold_war_side"] = movies["cold_war_side"].apply(lambda x: f'"{x}"')
movies.to_csv(PREPROCESSED_MOVIES, index=False)
movies["cold_war_side"].value_counts()

cold_war_side
"None"       19560
"Western"     3142
"Eastern"     2919
Name: count, dtype: int64