# Project

## Author: 16923

In [1]:
import nltk
import pandas as pd
import numpy as np
import random
import datetime
from user import User
import favouriteplaylist
# The ideal soloutions from PS 9 have been altered used for this problem set. 
from streaming_service import StreamingService
from show import Show
import playlist
from cleaner import clean_text
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore') # Warning messages have been checked and are not relevant. This does not stop errors being generated.

# Intro

## What am I going to do this project?
I will complete three tasks in python using clean and efficient coding. The tasks are:  
1) Create a class User, who will belong to a streaming service. Create a class for a favourite playlist, with inherited properties from playlist.  
2) Develop a more sophisticated searching method using an inverted index.  
3) Perform an analysis of the language used in Netflix titles and descriptions.

# Task 1

All the required parts of the class User and Favourite have been made, I shall provide a demonstration of it below. It should be noted that neither of these special playlists have a capacity option, being special playlists, these features are unnecessary.

## How does it work?

In [2]:
a = Show('b','sfdg','adsfg','c',1234,'d','e','1','f','qewr')
z = Show('z','234f','23df','h',1567,'p','t','234f','w','gasd')

# User takes the users birthday as arguments: dd, mm, yyyy
me = User(19,7,1999)
print("-------- Watch Later --------")
# Watch later playlist
me.watch_later(a) # Add a show with watch_later(show)
me.watch_later(z)
me.play_watch_later('b') # Play a show with play_watch_later(show_title), you may leave show title blank and it will play the next in line.
me.dont_watch_later('z') # Remove a show from watch later with dont_watch_later(show_title)
# Once a show is played it should leave the playlist
print(me.get_watch_later())
print("--------- Favourite ---------")
# Favourite Playlist
# These work in an identically to watch_later, except when a show is played it is not removed
me.favourite(a)
me.favourite(z)
me.play_favourite('b') # If no show is named, it will play a random show from the list.
me.unfavourite('z')
print(me.get_favourites())
 
print("--------- Playlist Interactions ---------")
# If a show is played from favourites that is also in watch later, it is removed from watch later.
me.watch_later(a)
me.play_favourite('b')
print('watch later:', me.get_watch_later())
print('favourite:', me.get_favourites())

print("--------- Other features ---------")
me.get_age() # Returns the users age. This uses the current date to make sure it's accurate
me.get_history() # Returns all the shows that have been played (ignoring duplicates)
me.clear_history() # Removes all items from the useres history. 
me.get_history()

-------- Watch Later --------
Playing the show b (1)
{}
--------- Favourite ---------
Playing the show b (1)
{'b': b (1234)}
--------- Playlist Interactions ---------
Playing the show b (1)
watch later: {}
favourite: {'b': b (1234)}
--------- Other features ---------
Your history has been cleared


{}

## Test

Using try and except statements I shall test my code. If this chunk runs to the end, it behaves as expected.

In [3]:
me = User(19,7,1999)

# Add a show to watch later
try:
    me.watch_later(a)
except:
    assert False, f"Attempted to add show to watch_later but an error occured."

# Play the show from watch later
played_show = str(me.play_watch_later())
assert played_show =="b (1234)", f"Expected the string Playing the show b (1234) but {played_show} is given"

# We should fail to play a show now as there's no show to play
try:
    me.play_watch_later() 
except IndexError:
    pass
except:
    assert False, f"Play a show from an empty playlist: expected to raise IndexError but another errortype is raised."
else:
    assert False, f"Play a show from an empty playlist: expected to raise IndexError but no error is raised."

# add a show to watch later
try:
    me.watch_later(a)
except:
    assert False, f"Attempted to add show to watch_later but an error occured."

# Remove the show from watch later
try:
    me.dont_watch_later('b')
except:
    assert False, 'An error has occured'

# A show shouldn't play now as no show is in watch later
try:
    played_show = me.play_watch_later()
except IndexError:
    pass
except:
    assert False, f"Play a show by name that is not available: expected to raise IndexError but another errortype is raised."
else:
    assert False, f"Play a show by name that is not available: expected to raise IndexError but no error is raised."

# Add a show to favourites
try:
    me.favourite(a)
except:
    assert False, f"Attempted to add show to favourite but an error occured."

# Add a second show to favourites
try:
    me.favourite(z)
except:
    assert False, f"Attempted to add show to favourite but an error occured."
    
# Remove a show from favouritese
try:
    me.unfavourite('b')
except:
    assert False, 'An error has occured'

# Add a show to watch later
try:
    me.watch_later(z)
except:
    assert False, f"Attempted to add show to watch_later but an error occured."

# Currently one show is in favourite and watch_later (the same show)
# Playing a show from favourite should remove it from watch later but no favourites
played_show = str(me.play_favourite()) # Only z is in favourite, it is also in watch_later, they should both be removed.
assert played_show =="z (1567)", f"Expected the string Playing the show b (1234) but {played_show} is given"

# Try and play a show from watch later, it shouldn't run
try:
    played_show = me.play_watch_later()
except IndexError:
    pass
except:
    assert False, f"Play a show by name that is not available: expected to raise IndexError but another errortype is raised."
else:
    assert False, f"Play a show by name that is not available: expected to raise IndexError but no error is raised."

# Play the same show again from favourites.
played_show = str(me.play_favourite())
assert played_show =="z (1567)", f"Expected the string Playing the show b (1234) but {played_show} is given"

# Get the age of the user
age = (me.get_age())
assert age == 21, f"Expected 21 but {age} is given"

# Now to test history
me = User(22,3,1962)

# We've just made it so history shouldn't be blank
hist = me.get_history()
assert hist == {}, f"Expected empty dictonary but {hist} given"

# Add a show to watch later
try:
    me.watch_later(a)
except:
    assert False, f"Attempted to add show to watch_later but an error occured."

# Play the show from watch later
played_show = str(me.play_watch_later())
assert played_show =="b (1234)", f"Expected the string Playing the show b (1234) but {played_show} is given"

# Check the show has been added the history
hist = (me.get_history())
assert list(hist.keys()) == ['b'], f"Expected dictionary with show b (1234) but {hist} given"

#Add show to favourite
try:
    me.favourite(z)
except:
    assert False, f"Attempted to add show to watch_later but an error occured."

# Play the show from favourites
played_show = str(me.play_favourite())
assert played_show =="z (1567)", f"Expected the string Playing the show b (1234) but {played_show} is given"

# Check it's been addeed to history
hist = (me.get_history())
assert list(hist.keys()) == ['b', 'z'], f"Expected dictionary with show 'b' and 'z' as keys) but {hist} given"

# Clear the history
try:
    me.clear_history()
except:
    assert "An unexpected error has arisen"

# Check the history is empty
hist = me.get_history()
assert hist == {}, f"Expected empty dictonary but {hist} given"


Playing the show b (1)
Playing the show z (234f)
Playing the show z (234f)
Playing the show b (1)
Playing the show z (234f)
Your history has been cleared


As the code runs to the end without error, the test is successful.

# Task 2

## Program design

My inverted index is built around a class. I chose to use a class rather than a function as it gives the ability to add to the index after it has been made, and to return the index without rebuilding it when it's needed. You would need to make numerous functions to do the same thing; this would be untidy and not very user friendly. A class also allows for other streaming services to be loaded in with ease. 

My inverted index makes use of a dictionary. This is because a dictionary is a sort of hash table with the words as the keys. This means it takes constant time, O(1) to access a word in the dictionary. This is preferred over other techniques such as a list of lists with [[word, [indexes], [word_2, indexes] ... ] as this would take O(n) to cycle through to check if a word is present.

In the dictionary, the word corresponds to a list of values. Finding a value in these lists has maximum running time O(n), in the case every show has the same word. However, the average runtime will be much less as all the stop words have been removed, so no single word will appear in nearly all the shows.

I cleaned my text with a function that I made. By using a function, it can be used as needed, and the code can be recycled for inputs of the search, question 3, and whenever else it’s needed.

## Implementation

In [4]:
netflix_data_raw = pd.read_csv('../data/netflix_titles.csv')
netflix_data = netflix_data_raw[['title', 'director', 'cast', 'country', 'type', 'date_added', 'rating', 'duration', 'listed_in', 'description']]
netflix_data.head() # The data has loaded in correctly

Unnamed: 0,title,director,cast,country,type,date_added,rating,duration,listed_in,description
0,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,TV Show,"August 14, 2020",TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,Movie,"December 23, 2016",TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,Movie,"December 20, 2018",R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,Movie,"November 16, 2017",PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,Movie,"January 1, 2020",PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [5]:
netflix = StreamingService('Netflix', netflix_data)

In [6]:
netflix.search('chess')

[Searching for Bobby Fischer (Movie),
 The Queen's Gambit (TV Show),
 Magnus (Movie),
 The Coldest Game (Movie),
 Creating The Queen's Gambit (Movie)]

In [7]:
netflix.search('the - chESs %champion')

[Magnus (Movie)]

In [8]:
netflix.search('tim, %Timed!!! !travelling &movies')

[About Time (Movie)]

I have written an additional optional argument for the search function called 'unsure'. If unsure is set to true, it shall return all the shows that match any word in the given phrase (except stop words). The output is ordered by the number of matches made.

In [9]:
netflix.search('tim, %Timed!!! !travelling &movies dad', True)

[About Time (Movie),
 Big Time Movie (Movie),
 The Time Traveler's Wife (Movie),
 Yo-Kai Watch: The Movie (Movie),
 Anesthesia (Movie),
 Enemy (Movie),
 Garfield's Fun Fest (Movie),
 Hans Zimmer: Live in Prague (Movie),
 The Cakemaker (Movie),
 Tim Allen: Men Are Pigs (Movie),
 Tim Allen: ReWires America (Movie),
 Tim Minchin And The Heritage Orchestra Live (Movie),
 Tim Minchin: So F**king Rock (Movie),
 Abby Sen (Movie),
 Action Replayy (Movie),
 AK vs AK (Movie),
 Awe (Movie),
 InuYasha the Movie: Affections Touching Across Time (Movie),
 Long Time Running (Movie),
 Monster High: Freaky Fusion (Movie),
 Once Upon a Time in Mumbai Dobaara! (Movie),
 Safety Not Guaranteed (Movie),
 Scream 3 (Movie),
 Spirit Riding Free: Spirit of Christmas (Movie),
 The Time Machine (Movie),
 يوم الدين (Movie),
 Blinky Bill: The Movie (Movie),
 Fishtronaut: The Movie (Movie),
 My Travel Buddy (Movie),
 Dad Wanted (Movie),
 Scary Movie 5 (Movie),
 The Larva Island Movie (Movie),
 1 Mile to You (Movie),

In the case that a word that has no matches, it will be excluded from the search. This stops typos and spelling errors having as much of an effect.

In [10]:
netflix.search('tyme travel tim')

Excluding 'tyme' from the search as no matches were found


[Extreme Engagement (TV Show), The Cakemaker (Movie), About Time (Movie)]

# Task 3
## Data analysis task (b)
### (a)
I'm goint to count the words used by year, to see if the topics have changed as time has gone on. I will group the years so that there is a sufficient amount of data.  
Groupings:  
since_2010: 2010 - Presant  
since_2000: 2000 - 2009    
since_1980: 1980 - 1999  
before_1980: 0 - 1980 

In [11]:
netflix_data_2 = netflix_data_raw[['title', 'director', 'cast','release_year', 'country', 'type', 'date_added', 'rating', 'duration', 'listed_in', 'description']]

In [12]:
def word_count(text, dic):
    words = clean_text(text)

    for word in words:
        if word in dic:
            dic[word] += 1
        else:
            dic[word] = 1
    return(dic)

In [13]:
since_2010 = {}
since_2000 = {}
since_1980 = {}
before_1980 = {}
for i in range(len(netflix_data_2)):
    if netflix_data_2[['release_year']].iloc[i,0] >= 2010:
        text = netflix_data_2[['description']].iloc[i,0]
        since_2010 = word_count(text, since_2010)
    elif netflix_data_2[['release_year']].iloc[i,0] >= 2000:
        text = netflix_data_2[['description']].iloc[i,0]
        since_2000 = word_count(text, since_2000)
    elif netflix_data_2[['release_year']].iloc[i,0] >= 1980:
        text = netflix_data_2[['description']].iloc[i,0]
        since_1980 = word_count(text, since_1980)
    elif netflix_data_2[['release_year']].iloc[i,0] < 1980:
        text = netflix_data_2[['description']].iloc[i,0]
        before_1980 = word_count(text, before_1980)
    else:
        print(netflix_data_2[['release_year']].iloc[i,0])

In [14]:
sorted(since_2010.items(), key=lambda x: x[1], reverse=True)

[('life', 603),
 ('young', 547),
 ('famili', 528),
 ('find', 513),
 ('new', 511),
 ('friend', 462),
 ('world', 425),
 ('love', 415),
 ('take', 404),
 ('live', 388),
 ('man', 388),
 ('two', 367),
 ('woman', 357),
 ('get', 319),
 ('seri', 312),
 ('documentari', 298),
 ('becom', 293),
 ('help', 279),
 ('must', 270),
 ('stori', 258),
 ('year', 254),
 ('``', 250),
 ('home', 242),
 ('one', 239),
 ('tri', 236),
 ('school', 236),
 ('father', 235),
 ('mysteri', 216),
 ('three', 214),
 ('turn', 213),
 ('teen', 211),
 ('murder', 210),
 ('follow', 210),
 ('group', 202),
 ('set', 201),
 ('forc', 199),
 ('make', 199),
 ('struggl', 196),
 ('learn', 195),
 ('team', 194),
 ('save', 192),
 ('secret', 192),
 ('girl', 192),
 ('meet', 185),
 ('fall', 183),
 ('special', 182),
 ('show', 179),
 ('explor', 178),
 ('face', 177),
 ('student', 176),
 ('mother', 175),
 ('fight', 171),
 ('work', 170),
 ('back', 168),
 ('power', 168),
 ('citi', 163),
 ('high', 163),
 ('way', 162),
 ('adventur', 162),
 ('return', 161

In [15]:
sorted(since_2000.items(), key=lambda x: x[1], reverse=True)

[('young', 66),
 ('find', 63),
 ('life', 58),
 ('love', 55),
 ('friend', 55),
 ('famili', 51),
 ('woman', 46),
 ('two', 46),
 ('man', 46),
 ('world', 43),
 ('becom', 43),
 ('must', 42),
 ('new', 41),
 ('one', 39),
 ('make', 38),
 ('live', 38),
 ('school', 37),
 ('take', 37),
 ('tri', 36),
 ('help', 36),
 ('girl', 35),
 ('seri', 33),
 ('get', 32),
 ('power', 32),
 ('student', 30),
 ('team', 29),
 ('stori', 28),
 ('son', 27),
 ('father', 26),
 ('three', 25),
 ('forc', 25),
 ('brother', 25),
 ('day', 24),
 ('murder', 24),
 ('set', 23),
 ('turn', 23),
 ('home', 23),
 ('group', 22),
 ('true', 22),
 ('dream', 22),
 ('begin', 22),
 ('secret', 22),
 ('daughter', 22),
 ('film', 21),
 ('marri', 21),
 ('work', 21),
 ('high', 21),
 ('four', 21),
 ('come', 21),
 ('discov', 21),
 ('move', 20),
 ('``', 20),
 ('return', 20),
 ('fall', 19),
 ('save', 18),
 ('documentari', 18),
 ("n't", 18),
 ('learn', 18),
 ('follow', 18),
 ('meet', 17),
 ('death', 17),
 ('struggl', 17),
 ('boy', 17),
 ('agent', 17),
 

In [16]:
sorted(since_1980.items(), key=lambda x: x[1], reverse=True)

[('young', 34),
 ('love', 32),
 ('famili', 31),
 ('life', 29),
 ('man', 27),
 ('take', 24),
 ('new', 22),
 ('fall', 22),
 ('woman', 20),
 ('power', 20),
 ('murder', 19),
 ('becom', 19),
 ('find', 19),
 ('must', 18),
 ('help', 17),
 ('turn', 17),
 ('friend', 17),
 ('``', 16),
 ('tri', 16),
 ('two', 15),
 ('brother', 15),
 ('return', 15),
 ('live', 15),
 ('forc', 15),
 ('father', 15),
 ('one', 14),
 ('school', 14),
 ('work', 14),
 ('girl', 14),
 ('get', 13),
 ('three', 13),
 ('set', 12),
 ('boy', 12),
 ('cop', 12),
 ('crimin', 11),
 ('whose', 11),
 ('come', 11),
 ('world', 11),
 ('wealthi', 11),
 ('son', 11),
 ('fight', 11),
 ('marri', 11),
 ('student', 10),
 ('teen', 10),
 ('daughter', 10),
 ('agent', 10),
 ('master', 10),
 ('crime', 10),
 ('seri', 10),
 ('war', 10),
 ('stori', 10),
 ('face', 9),
 ('death', 9),
 ('wife', 9),
 ('evil', 9),
 ('stop', 9),
 ('meet', 9),
 ('show', 9),
 ('teenag', 9),
 ('battl', 9),
 ('team', 9),
 ('grow', 8),
 ('move', 8),
 ('home', 8),
 ('comedi', 8),
 ('ki

In [17]:
sorted(before_1980.items(), key=lambda x: x[1], reverse=True)

[('world', 12),
 ('war', 12),
 ('new', 10),
 ('woman', 9),
 ('find', 9),
 ('film', 8),
 ('fall', 8),
 ('life', 8),
 ('love', 7),
 ('famili', 7),
 ('friend', 7),
 ('tri', 6),
 ('young', 6),
 ('one', 6),
 ('get', 6),
 ('take', 6),
 ('forc', 5),
 ('ii', 5),
 ('father', 5),
 ('wife', 5),
 ('daughter', 5),
 ('girl', 5),
 ('past', 5),
 ('documentari', 5),
 ('offic', 5),
 ('town', 5),
 ('man', 5),
 ('fight', 5),
 ('director', 4),
 ('live', 4),
 ('teen', 4),
 ('american', 4),
 ('plot', 4),
 ('transform', 4),
 ('secret', 4),
 ('danger', 4),
 ('follow', 4),
 ('music', 4),
 ('soon', 4),
 ('lead', 4),
 ('prepar', 4),
 ('return', 4),
 ('hire', 4),
 ('captur', 4),
 ('join', 4),
 ('two', 4),
 ('polic', 4),
 ('time', 4),
 ('nazi', 4),
 ('becom', 4),
 ('drama', 3),
 ('alli', 3),
 ('dark', 3),
 ('grow', 3),
 ('give', 3),
 ('enemi', 3),
 ('leav', 3),
 ('four', 3),
 ('america', 3),
 ('citi', 3),
 ('deadli', 3),
 ('high', 3),
 ('adapt', 3),
 ('reach', 3),
 ('mistress', 3),
 ('children', 3),
 ('professor', 

In each different time frame, we see very similar words used most. Since 1980, the categories seem near identical. Before then, we see it's similar, but the lack of data perhaps takes from the accuracy. The words 'young', 'love' and 'life' among many other appear very frequently, especially since 1980. Before 1980, we see that there are 12 movies/tv shows with 'world' in and 12 films with 'war' in, these being the most common words suggests that in this era many of the films were world war based. This is backed up with several other war-based words coming up for example there are 4 occurrences of 'pilot', 'nazi', 'capture' and 'secret'. Let's look at this further.

In [18]:
potential_war_films = []
for i in range(len(netflix_data_2)):
    if netflix_data_2[['release_year']].iloc[i,0] < 1980:
        text = netflix_data_2[['description']].iloc[i,0]
        if 'war' in text:
            potential_war_films.append(netflix_data_2[['description']].iloc[i,0])           
    else:
        continue

In [19]:
potential_war_films

['This wartime drama details a pivotal day in 1944 when an Allied task force tried to win World War II by seizing control of key bridges in Holland.',
 'In the age of Buddha and his philosophy of nonviolence, a warmonger king plots the destruction of an enemy kingdom to rescue the woman he loves.',
 'Luke Jackson likes to do things his own way, which leads to a world of hurt when he ends up in a prison camp – and on the wrong side of its warden.',
 "Lupin, his sidekick, Jigen, and the samurai warrior Goemon set out to take over an evil counterfeit operation at Count Cagliostro's fortress.",
 'Two close childhood friends take drastically different paths in life, but meet by chance years later and fall in love, unaware of their past bond.',
 'Returning home from war after being assumed dead, a pilot weds the woman he has long loved, unaware that she had been planning to marry his best friend.',
 'Director John Ford captures combat footage of the Battle of Midway, an air and sea campaign 

While not all these films are war based, many are indicating that they use to constitute a higher percentage of the films made than they are now.

### (b)

In [20]:
pair_publication_count = {}  
for row in range(len(netflix_data_2)):
    cast = str(netflix_data_2[['cast']].iloc[row,0]).split(', ')
    director = str(netflix_data_2[['director']].iloc[row,0]).split(', ')
    crew = cast + director
    crew = list(dict.fromkeys(crew)) # Remove duplicates in the case someone acts and directs the same show
    for person_1 in crew:                             
        for person_2 in crew:
            if person_1 < person_2:    # This assures there are no duplicate pairs :
                    
                pair_forward = (person_1, person_2)
                pair_backwards = (person_2, person_1) 
                if person_1 == 'nan' or person_2 == 'nan': # A number of rows have missing values, this gets rid of them
                    continue
                elif pair_forward in pair_publication_count:      
                    pair_publication_count[pair_forward] += 1  
                elif pair_backwards in pair_publication_count:   
                    pair_publication_count[pair_backwards] += 1   
                else:                                                                   
                    pair_publication_count[pair_forward] = 1     

In [21]:
sorted(pair_publication_count.items(), key=lambda x: x[1], reverse=True)[0:25]

[(('Jan Suter', 'Raúl Campos'), 19),
 (('John Paul Tremblay', 'Robb Wells'), 15),
 (('John Cleese', 'Terry Jones'), 14),
 (('John Cleese', 'Michael Palin'), 14),
 (('Eric Idle', 'John Cleese'), 14),
 (('Eric Idle', 'Terry Jones'), 14),
 (('Eric Idle', 'Michael Palin'), 14),
 (('Michael Palin', 'Terry Jones'), 14),
 (('Jamie Watson', 'Michela Luci'), 13),
 (('Eric Peterson', 'Michela Luci'), 13),
 (('Eric Peterson', 'Jamie Watson'), 13),
 (('Alessandro Juliani', 'Vincent Tong'), 13),
 (('Graham Chapman', 'John Cleese'), 13),
 (('Graham Chapman', 'Terry Jones'), 13),
 (('Graham Chapman', 'Michael Palin'), 13),
 (('John Cleese', 'Terry Gilliam'), 13),
 (('Eric Idle', 'Graham Chapman'), 13),
 (('Eric Idle', 'Terry Gilliam'), 13),
 (('Terry Gilliam', 'Terry Jones'), 13),
 (('Michael Palin', 'Terry Gilliam'), 13),
 (('Daisuke Ono', 'Yuki Kaji'), 12),
 (('Diana Kaarina', 'Vincent Tong'), 12),
 (('Anna Claire Bartlam', 'Michela Luci'), 12),
 (('Anna Claire Bartlam', 'Jamie Watson'), 12),
 (('A

We see that many of the people that appear in this list appear more than once. For example, Eric Idle, John Cleese and Michael Palin have all been in 14 shows with each other. This is because they have continually worked together as a group (Monty Python). I'd imagine this is like the other groups of names that appear in this list together.

### (c)

For my analysis, I will be finding the distribution of film lengths, and then comparing the average film lengths of films from various countries.

In [22]:
movies = netflix_data_2[netflix_data_2.type == 'Movie']

In [23]:
def length_distribution(data):
    lengths = pd.DataFrame(np.zeros((320, 1)))
    lengths = lengths.rename(columns = {0:'Freq'})
    for i in range(len(data)):
        length = data[['duration']].iloc[i,0]
        length = length.split(' ')
        length = int(length[0])
        lengths.iloc[[length]] += 1
    lengths['Freq'][2]
    lengths.index
    lengths['Density'] = lengths['Freq'] / lengths['Freq'].sum() # Operations on the columns of a pandas data frame such as these below are vectorised operations.
    graph = lengths['Density'].plot(color = 'yellow', legend = False, ylabel = "Density", xlabel = "Length (minutes)", title = "Density of movie lengths")
    graph.patch.set_facecolor('blue')
    lengths['Freq'][2]
    lengths.index
    lengths['Density'] = lengths['Freq'] / lengths['Freq'].sum()
    lengths['Average contribution'] = lengths['Density'] * lengths.index
    return lengths

In [None]:
all_movies = length_distribution(movies)

In [None]:
all_movies['Average contribution'].sum()

The average (mean) film is 99 minutes long 

In [None]:
print(all_movies['Density'][1:98].sum())
print(all_movies['Density'][1:99].sum())

The median film is 98 mintes long

The mean is marginally higher than the median. This is could be down to a few extream values (films over 300 minutes for example. Let's look at these films:

In [None]:
for i in range(len(movies)):
    film_len = int(movies['duration'].iloc[i][:-4])
    movies['duration'].iloc[i] = film_len
movies.sort_values('duration').iloc[-20:]

The longest item is an episode of Black Mirror. This episode was an 'interactive' movie. This means that the watching time in reality, is much lower than the quoted figure. The figure corresponds to the length of all the possible stories combined. 

Beyond this we see many of the longer films were produced in Egypt and India.

In [None]:
egypt_films = movies[movies.country == 'Egypt']
len(egypt_films)

In [None]:
egypt_films['duration'].mean()

In [None]:
egypt_films['duration'].median()

As hypothesised, these films have a longer than average length. Let's check the same for India too.

In [None]:
india_films = movies[movies.country == 'India']
len(india_films)

In [None]:
india_films['duration'].mean()

In [None]:
india_films['duration'].median()

These films are even longer than those from egypt.  
Let's try and find a country with shorter films.

In [None]:
movies.sort_values('duration').iloc[:20]

It appears the the US has shorter films than average.

In [None]:
us_films = movies[movies.country == 'United States']
len(us_films)

In [None]:
us_films['duration'].mean()

In [None]:
us_films['duration'].median()

The films from the US appear to have a smaller than average runtime. An interesting observation is that for every country, and in the full sample, the median length was almost equal to the mean length. The analysis strongly suggests than the country that the film is from is correlated to the length of the film.

My analysis has several limitations. Firstly, Netflix only provide films that are popular. Thus, there is a chance that the films on Netflix are not representative of all films. For example, it could be that long films are less popular, so Netflix possess fewer long on their platform. Another limitation is the number of films from Egypt was relatively low, although, I don't think it was low enough to invalidate my results.

# Conclusion and Reflection

## What have I done?  
### My Code
I have shown that I possess and excellent understanding of all the course materials. I have appropriately and effectively utilised all the required tools. My code fulfils all the criteria:  
Programming Skills: I have appropriately used an array of programming techniques including modularisation, functions, classes, pandas and others.  
Robustness: In the case of errors, a useful error is raised / message is provided to the user to highlight possible issues.  
Readability and Style: PEP 8 Style has been used throughout.  
Program Design: The questions have been decomposed into simple steps. An example of this is the cleaner function which takes text and cleans it. Being a function, I have been able to use it at different times when it was needed. The code was only written once (in the function) so it keeps it tidy and is easy to update.  
Testing: Throughout development my code has been tested to assure it works, where requested, I have provided testing.  
Self-Learning: I have used libraries not mentioned in this course. These include ateutil.relativedelta, collections.OrderedDict, matplotlib.pyplot.  
Analysis: I perform a sensible and informative analysis. Reflections are provided below.  
Communication: My report is clear and concise.


## What are the limitations?
Python possesses extensive libraries. While I have gone to effort to assure my code be as fast and efficient as possible, I'm sure there are countless ways to improve upon it.

Code in general is also very rigid. It requires perfect inputs, otherwise it won't run.

## What can be improved?
For question three, my analysis is very simple. This is partly down to my inexperience with natural language processing. I'm unfamiliar with common and key analysis with text. If I had more time, I could have provided a more in-depth analysis, looking at other variables, possibly running a regression to see factors that have an effect on film length.  
Also, as previously mentioned, my code could probably be more efficient. For the large loop loading in the Netflix data to the streaming service, I could have used multiprocessing to speed it up.  
If I had more time, I could produce more features for question 2, making it more realistic. I could also update the search function to treat numbers properly regardless of how you write them, i.e., 'one' as 1. 
