Data sourced from https://www.kaggle.com/datasets/bharatnatrayn/movies-dataset-for-feature-extracion-prediction/

### How can we detect problems with the data?
- Sample the data using `.head()` and look for problems
- Count `NaN` values
- Use `.value_counts()` to find any non-NaN default values

In [23]:
import pandas

df = pandas.read_csv("./1.0.csv")
df.head()

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,


In [24]:
print("% of non-null values")
df.count() / df.shape[0]

% of non-null values


MOVIES      1.000000
YEAR        0.935594
GENRE       0.991999
RATING      0.817982
ONE-LINE    1.000000
STARS       1.000000
VOTES       0.817982
RunTime     0.704170
Gross       0.046005
dtype: float64

In [25]:
for column in df.columns:
    print(f"---- {column} ---")
    print(df[column].value_counts().head(5))

---- MOVIES ---
 Bleach: Burîchi                         65
 Mighty Little Bheem                     64
 Avatar: The Last Airbender              61
 La Reina de Indias y el Conquistador    60
 Dexter                                  48
Name: MOVIES, dtype: int64
---- YEAR ---
(2020– )    892
(2021– )    658
(2020)      639
(2019– )    549
(2019)      544
Name: YEAR, dtype: int64
---- GENRE ---
\nComedy                                      852
\nAnimation, Action, Adventure                693
\nDrama                                       562
\nDocumentary                                 498
\nCrime, Drama, Mystery                       336
Name: GENRE, dtype: int64
---- RATING ---
7.2    331
7.6    309
7.5    309
7.4    300
7.3    299
Name: RATING, dtype: int64
---- ONE-LINE ---
\nAdd a Plot\n                                                                                                                                     1265
\nWith kindness, curiosity and childlike wonder, five best 

| What problems should we worry about?                                                | What can we do about these problems?                                                                                                                                                            |
| ----------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Useless index                                                                       | Make the MOVIES field the index                                                                                                                                                                 |
| Many duplicate values in `MOVIES`                                                   | Drop all rows that are completely identical, compare resulting duplicates                                                                                                                       |
| Years Running are stored in one string field                                        | Parse with RegEx <br>- split into a start and end field <br>- leave end as NaN if still running <br>- leave both as NaN if source is NaN <br>- correctly handle shows that only lasted one year |
| `GENRE` field is oddly formatted                                                    | Trim & split on `, `                                                                                                                                                                            |
| NaN values in `VOTES`                                                               | Default `VOTES` to `0`                                                                                                                                                                          |
| Extraneous whitespace in `ONE-LINE`                                                 | Trim field                                                                                                                                                                                      |
| `\nAdd a Plot\n` as a default value for `ONE-LINE`                                  | Replace with NaN                                                                                                                                                                                |
| Confusing field name for `ONE-LINE`                                                 | Manually rename to `decription`                                                                                                                                                                 |
| Many fields combined into the `STARS` field                                         | Parse with string operators, split into `directors` and `stars` fields                                                                                                                          |
| `Gross` field is mostly NaN                                                         | Drop column as it wouldn't be useful enough for anything                                                                                                                                        |
| Inconsistant capitalization in field names (`SCREAMING-KEBAB-CASE` vs `PascalCase`) | Manually rename all to use `snake_case`                                                                                                                                                         |
