# IST256 Project Deliverable 3 (P3)

## Phase 3: Data Story / Coding for Explanation

In this step, you submit the final version of your working code. You should be implementing the data story that you discussed in P2 (2.3.1). 

All code necessary to make the project run should be included in this notebook. This include all imports, functions, setup code and your interact. There should be no code that causes errors or exploratory code here.

The expectation is that your instructor can open this notebook, run all cells, and then use your program.

The code you write should be clear, easy to understand and use the affordances learned in the course.

No changes to your code will be considered after this submission. It is important to take prior instructor feedback taken into consideration and these factor into your evaluation.


### Step 1: Summarize Enhancements and Changes

If there were any enhancement or changes to your P3 from your P2 (including those you suggested), please explain them here. For example you might have geocoded your dataset or extracted entities from the text.


From my P2 to my P3, I figured out how I was going to present my data story in my tool. Based on the information from P2, I decided it was plausible to create a histogram (plot_histogram function) to predict how many movies and TV shows will be presented to the user prior to scrolling down and seeing the actual list. Also, I noticed that it would be interesting to add more inputs to change the user's output based on more things than just genre. Therefore, I added an input of lowest rating and film type. To complete my data story, I thought it was important to use a cartopy map (plot_flight_path function) to display the flight path. After learning cartopy for P2, I knew it would be something that could add to my final project.

Here is my journal starting at when I started P3:

4/22
- started p3
- engineered my netflix columns
    - made the duration column numerical
    - added a tv shows column
    - added a rounded rating column
- added the haversine formula and estimate_flight_durations formula to my final code

4/23
- learned matplotlib
- made a cartopy map that can show the flight path
- made the plot_flight_path function

4/24
- began the plot_histogram function
    - only for inputs of genre and flight duration for now
- learned how to add a vertical line on the histogram to show my flight duration
- created the get_movies function
    - only for inputs of genre and flight duration for now
    
4/26
- started my onclick function using interact manual
- called my functions to show the flight path and histograms
- called get_movies function to find movies based on input (genre and flight duration)

4/28
- added inputs of film type and lowest rating
    - changed my functions
- added nicer looking widgets
- used html to display nice headings rather than just prints

5/1
- noticed an issue with the tv column (some films listed as tv are movies that were made for television broadcasting)
- added a warning when tv or all is selected
- finalized p3

### Step 2: Project Code

Include all project code below. This includes code that enhances the original dataset. Make sure to execute your code to ensure it runs properly before you turn it in. 

Add as many cells as you need here.


In [3]:
!pip install cartopy 
#this can be used to install cartopy, it is not in the same cell because it makes the output messy if it is in the same cell



In [3]:
import itertools
import pandas as pd
import seaborn as sns
import math
import warnings
import matplotlib.pyplot as plt #one thing I learned on my own
from IPython.display import display, HTML
from ipywidgets import interact_manual, Dropdown, IntSlider
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import cartopy.crs as ccrs #one thing I learned on my own

warnings.filterwarnings('ignore')
%matplotlib inline

airportdf = pd.read_csv("airports.csv")
netflixdf = pd.read_csv("n_movies.csv")

#engineer new column to indicate TV shows vs movies
netflixdf['TV'] = netflixdf['certificate'].fillna('').apply(lambda row: 'yes' if 'TV' in row else 'no')

#engineer new column to show the rounded rating of the show or movie
netflixdf['rounded_rating'] = netflixdf['rating'].round()

#clean duration column of the dataset
netflixdf['duration'] = netflixdf['duration'].fillna(0)
netflixdf['duration'] = netflixdf['duration'].astype(str).str.replace('min', '')
netflixdf['duration'] = pd.to_numeric(netflixdf['duration'], errors='coerce') #https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html (this converts items in the column to numerical values)

def haversine_distance(lon1, lat1, lon2, lat2):
    """
    calculating the great-circle distance between two points
    on the earth's surface using the haversine formula
    """
    #conversion from degrees to radians (help from chatgpt)
    lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])

    #haversine formula to find distance based on lon and lat (https://gist.github.com/rochacbruno/2883505?permalink_comment_id=2615334)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    radius = 6371  #radius of the Earth in kilometers
    distance = radius * c

    return distance  #distance in kilometers

def estimate_flight_duration(departure_airport, arrival_airport, average_speed=800):
    """
    estimating the duration of a flight based on the great-circle distance
    between the starting and ending airports and an average flight speed
    """
    start_lon = airportdf.loc[airportdf['iata'] == departure_airport, 'long'].values[0]
    start_lat = airportdf.loc[airportdf['iata'] == departure_airport, 'lat'].values[0]
    end_lon = airportdf.loc[airportdf['iata'] == arrival_airport, 'long'].values[0]
    end_lat = airportdf.loc[airportdf['iata'] == arrival_airport, 'lat'].values[0]

    #great-circle distance between the two airports
    distance_km = haversine_distance(start_lon, start_lat, end_lon, end_lat)

    #duration of the flight (with constant speed)
    duration_mins = distance_km / average_speed * 60

    return duration_mins

def plot_flight_path(departure_airport, arrival_airport):
    #get coordinates of departure and arrival airports
    departure_lon = airportdf.loc[airportdf['iata'] == departure_airport, 'long'].values[0]
    departure_lat = airportdf.loc[airportdf['iata'] == departure_airport, 'lat'].values[0]
    arrival_lon = airportdf.loc[airportdf['iata'] == arrival_airport, 'long'].values[0]
    arrival_lat = airportdf.loc[airportdf['iata'] == arrival_airport, 'lat'].values[0]

    #create a cartopy map (got help from ChatGPT)
    fig = plt.figure(figsize=(10, 6)) 
    ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree()) #https://scitools.org.uk/cartopy/docs/latest/reference/projections.html (this line of code sets the plot size and projection)

    #plot departure and arrival airports
    ax.plot(departure_lon, departure_lat, 'ro', markersize=8, label='Departure Airport') #'ro' is the color and shape of the plot mark
    ax.plot(arrival_lon, arrival_lat, 'bo', markersize=8, label='Arrival Airport')

    #plot the path between departure and arrival airports and set coastlines
    ax.plot([departure_lon, arrival_lon], [departure_lat, arrival_lat], color='black', linestyle='--', transform=ccrs.Geodetic()) #help from chatgpt to learn linestyle and transform (to set the type of line and the projection of the line)
    ax.coastlines()

    #set map extent and gridlines
    ax.set_extent([-180, 180, -90, 90], crs=ccrs.PlateCarree())
    ax.gridlines(draw_labels=True)

    #add legend
    ax.legend()

    #add title
    ax.set_title(f'Flight Path from {airportdf.loc[airportdf["iata"] == departure_airport, "airport"].values[0]} to {airportdf.loc[airportdf["iata"] == arrival_airport, "airport"].values[0]}')

    plt.show()

def plot_histogram(genre, flight_duration, lowest_rating, tv=None):
    if genre == '*ANY*':
        #consider tv and rating selections
        if tv == 'Movies':
            mask_tv = netflixdf['TV'] == 'no'
        elif tv == 'TV':
            mask_tv = netflixdf['TV'] == 'yes'
        else:
            mask_tv = netflixdf['TV'].notna()  #include both tv shows and movies
        
        mask_rating = netflixdf['rating'] >= lowest_rating
        durations = netflixdf[mask_tv & mask_rating]['duration']
        
        plt.figure(figsize=(10, 6)) #learned matplotlib code to set figure size, labels, and title from https://matplotlib.org/stable/api/matplotlib_configuration_api.html
        plt.xlabel('Duration (Minutes)')
        plt.ylabel('Frequency')
        plt.title(f'Histogram of Film Durations for All Genres with Rating >= {lowest_rating}') 
        sns.histplot(durations.dropna(), bins=40) #it was easier for me to visualize all the changes to the histogram by using matplotlib rather than doing it all on one line

        #add flight duration annotation (a vertical line on the histogram)
        plt.axvline(x=flight_duration, color='red', linestyle='--', label=f'Flight Duration: {flight_duration:.2f} minutes') #https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axvline.html
        plt.legend()
        plt.show()
    else:
        #plot histogram for specific genre
        mask_genre = netflixdf['genre'].fillna('').str.contains(genre)
        mask_rating = netflixdf['rating'] >= lowest_rating
        if tv == 'Movies':
            mask_tv = netflixdf['TV'] == 'no'
        elif tv == 'TV':
            mask_tv = netflixdf['TV'] == 'yes'
        else:
            mask_tv = netflixdf['TV'].notna()  #include both TV shows and movies
        durations = netflixdf[mask_genre & mask_rating & mask_tv]['duration']
            
        #copy and pasted from earlier code
        plt.figure(figsize=(10, 6))
        plt.xlabel('Duration (Minutes)')
        plt.ylabel('Frequency')
        plt.title(f'Histogram of Film Durations for Genre: {genre} with Rating >= {lowest_rating}')
        sns.histplot(durations.dropna(), bins=40)

        plt.axvline(x=flight_duration, color='red', linestyle='--', label=f'Flight Duration: {flight_duration:.2f} minutes')
        plt.legend()
        plt.show()

def get_movies(genre, lowest_rating, flight_duration, tolerance=30):
    if genre == '*ANY*':
        mask_genre = netflixdf['duration'].notna()  #filter out null durations
    else:
        mask_genre = netflixdf['genre'].fillna('').str.contains(genre)
    mask_rating = netflixdf['rating'] >= lowest_rating
    mask_duration = (netflixdf['duration'] > 0) & (netflixdf['duration'] >= flight_duration - tolerance) & (netflixdf['duration'] <= flight_duration + tolerance)
    
    close_movies = netflixdf[mask_genre & mask_rating & mask_duration] 
    close_movies = close_movies.drop_duplicates(subset=['title']) #https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.DataFrame.drop_duplicates.html (drops duplicate titles in the df)
    return close_movies[['title', 'duration', 'genre', 'rating', 'TV', 'description']]
        
#calculate unique genres
unique_genres = set()
for genre in netflixdf['genre']:
    if isinstance(genre, str): #got some extra help to relearn isinstance from https://www.w3schools.com/python/ref_func_isinstance.asp
        unique_genres.update(genre.replace(",", "").split()) #.update is used to update a set (unique_genres is a set)

#add "any" option to unique genres
unique_genres_with_any = ['*ANY*'] + list(sorted(unique_genres))

#calculate mean durations for each genre
genre_mean_duration = {}
for genre in unique_genres:
    mask_genre = netflixdf['genre'].fillna('').str.contains(genre)
    durations = netflixdf.loc[mask_genre, 'duration']
    mean_duration = durations.mean()
    genre_mean_duration[genre] = mean_duration

#turn genre_mean_duration into a df
genre_mean_duration_df = pd.DataFrame(genre_mean_duration.items(), columns=['genre', 'mean duration'])
genre_mean_duration_df = genre_mean_duration_df.sort_values(by='mean duration')

#make nice looking widgets with descriptions that are fully visible (https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html)
departure_widget = widgets.Dropdown(options=airportdf['iata'], description='Departure Airport', style={'description_width': 'initial'}) #got help from ChatGPT to use 'description_width' (used to make the entire description visible)
arrival_widget = widgets.Dropdown(options=airportdf['iata'], description='Arrival Airport', style={'description_width': 'initial'})
genre_widget = widgets.Dropdown(options=unique_genres_with_any, description='Genre', style={'description_width': 'initial'})
tv_widget = widgets.Dropdown(options=['All', 'Movies', 'TV'], description='Type', style={'description_width': 'initial'})
lowest_rating_widget = widgets.IntSlider(min=1, max=10, description='Lowest Rating', style={'description_width': 'initial'})

display(HTML('<h1>Films for Flights</h1>'))
@interact_manual(departure_airport=departure_widget, arrival_airport=arrival_widget, genre=genre_widget, tv=tv_widget, lowest_rating=lowest_rating_widget)
def onclick(departure_airport, arrival_airport, genre, tv, lowest_rating):
    if departure_airport == arrival_airport:
        display(HTML("<h3>Departure and arrival airports cannot be the same.</h3>"))
        return

    duration_mins = estimate_flight_duration(departure_airport, arrival_airport)
    display(HTML(f"<h3>The duration of the flight from {airportdf.loc[airportdf['iata'] == departure_airport, 'airport'].values[0]} to {airportdf.loc[airportdf['iata'] == arrival_airport, 'airport'].values[0]} is approximately {duration_mins:.2f} minutes.</h3>"))
    plot_flight_path(departure_airport, arrival_airport)
    plot_histogram(genre, duration_mins, lowest_rating, tv)  #pass tv parameter here
    
    #get close movies
    if tv == 'All': 
        close_movies = get_movies(genre, lowest_rating, duration_mins)
        print('**Some films listed as "TV" are movies that were made for television broadcasting**')
    elif tv == 'Movies':
        close_movies = get_movies(genre, lowest_rating, duration_mins)
        close_movies = close_movies[close_movies['TV'] == 'no']
    elif tv == 'TV': #some of these are movies that were made for television broadcasting
        close_movies = get_movies(genre, lowest_rating, duration_mins)
        close_movies = close_movies[close_movies['TV'] == 'yes']
        print('**Some films listed as "TV" are movies that were made for television broadcasting**')
    else:
        close_movies = get_movies(genre, lowest_rating, duration_mins, tolerance=30)
    
    #display list of close movies
    if not close_movies.empty:
        display(HTML('<h3>\nClose Movies:</h3>'))
        display(HTML(close_movies.sample(n=min(len(close_movies), 15)).to_html())) #learned from ChatGPT (randomly selects up to 15 rows from the df and displays them as an html-formatted table)
    else:
        display(HTML("<h3>\nNo movies found that match the criteria and duration range.</h3>"))

interactive(children=(Dropdown(description='Departure Airport', options=('00M', '00R', '00V', '01G', '01J', '0…

### Prepare for your Pitch and Reflection (P4)

With the project code complete, its time to prepare for the final deliverable - submitting your project demo Pitch and reflection.


In [None]:
# run this code to turn in your work!
from casstools.assignment import Assignment
Assignment().submit()

✅ TIMESTAMP  : 2024-05-04 17:45
✅ COURSE     : ist256
✅ TERM       : spring2024
✅ USER       : ahschiff@syr.edu
✅ STUDENT    : True
✅ PATH       : ist256/spring2024/lessons/project/P3.ipynb
✅ ASSIGNMENT : P3.ipynb
✅ POINTS     : 0
✅ DUE DATE   : 2024-05-07 23:59
✅ LATE       : False
✅ STATUS     : New Submission



❓ Submit? [y/n] ❓  y
