# IMDB Data Generator
This a script for generating modified data using the `titles.basic`file from IMDB. 

## Instructions

1. Visit the [IMDB dataset website](https://datasets.imdbws.com) and download the latest `title.basics.tsv.gz`file
2. Extract the file and move it (`title.basics.tsv`) to the folder where this script is
3. Define your preferences in the section bellow
4. Run all the rows of this script

The output of this script would be the following files:

* Modified dataset: `modified.csv`
* Original dataset: `original.csv`

## Libraries
The following lines are in charge of loading the `csv` files and the necesary functions for this script.

**Warning:** Wait until the `Done reading!` line is displayed before runing the rest of the script.

In [211]:
import pandas as pd

print("Reading file ...")
df = pd.read_csv('title.basics.tsv', delimiter='\t',encoding='utf-8-sig', low_memory=False)
print("Done reading!")

Reading file ...
Done reading!


In [212]:
import random
from string import ascii_letters

def modify_str(s, n=3):
    inds = [i for i,_ in enumerate(s) if not s.isspace()]
    
    if len(inds) < n: return s

    sam = random.sample(inds, n)

    letts =  iter(random.sample(ascii_letters, n))
    lst = list(s)
    for ind in sam:
        lst[ind] = next(letts)

    return "".join(lst)

## Preferences
In this section you can set the following properties:

* Selection of fields from the `csv` file
* The number of entries the output file should have
* How many of those entries should be modified (ex. 20% -> `dat_mod_quote = 0.2`)

In [233]:
data = df[['primaryTitle', 'originalTitle', 'titleType','startYear', 'genres']]
data_size = 10000
dat_mod_quote = 0.2

## Script Operations

### 1. Sample Generation

In [234]:
modified = data.sample(data_size)
original = modified.copy()
indexes = modified.index.values
columns = modified.columns.values

### 2. Data Modification

In [235]:
to_modify = modified.sample(int(data_size * dat_mod_quote))
to_modify_indexes = to_modify.index.values

In [236]:
for index in to_modify_indexes :
    modified.at[index, 'primaryTitle'] = modify_str(modified.at[index, 'primaryTitle'])
    modified.at[index, 'originalTitle'] = modify_str(modified.at[index, 'originalTitle'])
    modified.at[index, 'titleType'] = modify_str(modified.at[index, 'titleType'], 1)

### 3. Output

In [237]:
print("Generating output files...")
modified.to_csv('modified.csv', sep='\t')
original.to_csv('original.csv', sep='\t')
print("Files generated!")

Generating output files...
Files generated!
