# Studio: Working with Databases in Python

For today's studio, we will be using the [TV Shows dataset](https://www.kaggle.com/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney) from Kaggle. We have already downloaded the CSV for you.

You will use the watchlist you created to answer these questions:

1. **Which streaming services contain the shows you want to watch next?**
2. **Which streaming service is the best value based on the shows you want to watch?**

As you complete the different tasks in the studio, you may choose between using Pandas or SQL. 

**Remember**: we learned in our prep work that one is oftentimes more efficient at certain tasks than the other, so choose wisely!

## My Watchlist

If you would like, please use this space to make note of your watchlist by editing the text cell. You will need 10 shows overall.

1. Gladiator
2. Gladiator II
3. House of Dragon
4. Hangover
5. Game of Thrones
6. Raising Kanan
7. SnowFall 
8. The Rings of Power
9. Squid Game
10. The Last of Us

## Database Setup

Import the necessary libraries and create a dataframe from the provided CSV. 

Print the info out for the dataframe. 

After that, you may drop the column called `Unnamed: 0` and rename any columns with spaces or unusual characters in the names such as `"Disney+"`. 

Print out the info for the dataframe again to ensure your changes were made.

In [7]:
# Code here

import sqlite3 as sl
import pandas as pd

tv_shows = pd.read_csv(r"C:\Users\charl\Desktop\LaunchCode-Data-Analysis\Data-Analysis-Projects\data-analysis-projects-class-19-and-20-main\class-20\studio\tv_shows.csv")

tv_shows = tv_shows.drop(columns=['Unnamed: 0'])
tv_shows = tv_shows.rename(columns = {'Rotten Tomatoes': 'Rotten_Tomatoes', 'Prime Video': 'Prime', 'Disney+': 'Disney'})

tv_shows.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5368 entries, 0 to 5367
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               5368 non-null   int64 
 1   Title            5368 non-null   object
 2   Year             5368 non-null   int64 
 3   Age              3241 non-null   object
 4   IMDb             4406 non-null   object
 5   Rotten_Tomatoes  5368 non-null   object
 6   Netflix          5368 non-null   int64 
 7   Hulu             5368 non-null   int64 
 8   Prime            5368 non-null   int64 
 9   Disney           5368 non-null   int64 
 10  Type             5368 non-null   int64 
dtypes: int64(7), object(4)
memory usage: 461.4+ KB


With your dataframe at the ready, create a new database called `tv.db`. 

Add a new table to your database called `shows` using the data in the dataframe. 

In [12]:
# # Code here

# Re-open connection
con = sl.connect("tv.db")

# database contents
df_loaded = pd.read_sql('SELECT * FROM shows', con)

# print
print(df_loaded.head())

# Close connection
con.close()


   Unnamed: 0  ID             Title  Year  Age    IMDb Rotten Tomatoes  \
0           0   1      Breaking Bad  2008  18+  9.4/10         100/100   
1           1   2   Stranger Things  2016  16+  8.7/10          96/100   
2           2   3   Attack on Titan  2013  18+  9.0/10          95/100   
3           3   4  Better Call Saul  2015  18+  8.8/10          94/100   
4           4   5              Dark  2017  16+  8.8/10          93/100   

   Netflix  Hulu  Prime Video  Disney+  Type  
0        1     0            0        0     1  
1        1     0            0        0     1  
2        1     1            0        0     1  
3        1     0            0        0     1  
4        1     0            0        0     1  


With your new table and database set up, print out the top 20 records in the `shows` table.

In [16]:
# Code Here
import sqlite3 as sl
import pandas as pd

tv_shows = pd.read_csv(r"C:\Users\charl\Desktop\LaunchCode-Data-Analysis\Data-Analysis-Projects\data-analysis-projects-class-19-and-20-main\class-20\studio\tv_shows.csv")

# Print
print(tv_shows.head(20))

# SQLite database 
con = sl.connect("tv.db")

# SQL table 
tv_shows.to_sql('shows', con, if_exists='replace', index=False)

# first 20 records
with con:
    data = con.execute("SELECT * FROM shows WHERE ID <= 20")
    for row in data:
        print(row)

# Close
con.close()


    Unnamed: 0  ID                       Title  Year  Age    IMDb  \
0            0   1                Breaking Bad  2008  18+  9.4/10   
1            1   2             Stranger Things  2016  16+  8.7/10   
2            2   3             Attack on Titan  2013  18+  9.0/10   
3            3   4            Better Call Saul  2015  18+  8.8/10   
4            4   5                        Dark  2017  16+  8.8/10   
5            5   6  Avatar: The Last Airbender  2005   7+  9.3/10   
6            6   7              Peaky Blinders  2013  18+  8.8/10   
7            7   8            The Walking Dead  2010  18+  8.2/10   
8            8   9                Black Mirror  2011  18+  8.8/10   
9            9  10          The Queen's Gambit  2020  18+  8.6/10   
10          10  11                  Mindhunter  2017  18+  8.6/10   
11          11  12                   Community  2009   7+  8.5/10   
12          12  13                      Narcos  2015  18+  8.8/10   
13          13  14                

Now, create a new table called `watchlist` that has three fields:
1. id -> data type of `INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT`
2. title -> data type of `TEXT`
3. importance_rank -> data type of `INTEGER`

For the `importance_rank` field, rank each of your watchlist shows based on how much you want to see them, `10` being the most important and `1` being the least important.

Then, insert each of the items from your watchlist into the new `watchlist` table, using the `executemany` method from our exercises.

Finally, select all the records from the `watchlist` table and print them out to the console.

In [3]:
# Code here

import sqlite3 as sl

# SQLite database
con = sl.connect("tv.db")

# 'watchlist'
with con:
    con.execute("""
        CREATE TABLE IF NOT EXISTS watchlist (
            id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT,
            title TEXT,
            importance_rank INTEGER
        );
    """)

    # SQL 'watchlist'
    sql = 'INSERT INTO watchlist (title, importance_rank) VALUES (?, ?)'
    watchlist_data = [
        ('Gladiator', 9),
        ('Gladiator II', 8),
        ('House of Dragon', 10),
        ('Hangover', 3),
        ('Game of Thrones', 7),
        ('Raising Kanan', 4),
        ('SnowFall', 5),
        ('The Rings of Power', 6),
        ('Squid Game', 2),
        ('The Last of Us', 1)
    ]

    # 'watchlist' table
    con.executemany(sql, watchlist_data)

    # print 'watchlist'
    data = con.execute("SELECT * FROM watchlist")
    for row in data:
        print(row)

con.close()


(1, 'Gladiator', 9)
(2, 'Gladiator II', 8)
(3, 'House of Dragon', 10)
(4, 'Hangover', 3)
(5, 'Game of Thrones', 7)
(6, 'Raising Kanan', 4)
(7, 'SnowFall', 5)
(8, 'The Rings of Power', 6)
(9, 'Squid Game', 2)
(10, 'The Last of Us', 1)
(11, 'Gladiator', 9)
(12, 'Gladiator II', 8)
(13, 'House of Dragon', 10)
(14, 'Hangover', 3)
(15, 'Game of Thrones', 7)
(16, 'Raising Kanan', 4)
(17, 'SnowFall', 5)
(18, 'The Rings of Power', 6)
(19, 'Squid Game', 2)
(20, 'The Last of Us', 1)
(21, 'Gladiator', 9)
(22, 'Gladiator II', 8)
(23, 'House of Dragon', 10)
(24, 'Hangover', 3)
(25, 'Game of Thrones', 7)
(26, 'Raising Kanan', 4)
(27, 'SnowFall', 5)
(28, 'The Rings of Power', 6)
(29, 'Squid Game', 2)
(30, 'The Last of Us', 1)


## Working with the Data

Using Pandas or SQL, find the answer to these 2 questions:
1. How many of the total shows (full csv list) are on each streaming service?
2. What percentage of these total shows is available on each streaming service?

**Hint**:

Use the pandas `query` method to filter the data, and then the Python `len` method to find it's length. [Relevant Link](https://www.geeksforgeeks.org/ways-to-filter-pandas-dataframe-by-column-values/)

In [12]:
# Code here
import pandas as pd

# Load the dataset
file_path = r"C:\Users\charl\Desktop\LaunchCode-Data-Analysis\Data-Analysis-Projects\data-analysis-projects-class-19-and-20-main\class-20\studio\tv_shows.csv"
tv_shows = pd.read_csv(file_path)

# Calculate the total number of shows
total_shows_count = len(tv_shows)

# Calculate the count of shows on each streaming service using the correct column names
netflix_count = len(tv_shows.query('Netflix == 1'))
hulu_count = len(tv_shows.query('Hulu == 1'))
prime_count = len(tv_shows.query('`Prime Video` == 1'))  # Corrected column name
disney_count = len(tv_shows.query('`Disney+` == 1'))    # Corrected column name

# Calculate the percentage of total shows available on each streaming service
pct_netflix = (netflix_count / total_shows_count) * 100
pct_hulu = (hulu_count / total_shows_count) * 100
pct_prime = (prime_count / total_shows_count) * 100
pct_disney = (disney_count / total_shows_count) * 100

# Create a DataFrame to store and display the results
df_streamer_metrics = pd.DataFrame({
    'Streaming Service': ["Netflix", "Hulu", "Prime Video", "Disney+"],
    'Total Shows on Service': [netflix_count, hulu_count, prime_count, disney_count],
    'Percentage of Total Shows': [pct_netflix, pct_hulu, pct_prime, pct_disney]
})

# Display the DataFrame
print(df_streamer_metrics)



  Streaming Service  Total Shows on Service  Percentage of Total Shows
0           Netflix                    1971                  36.717586
1              Hulu                    1621                  30.197466
2       Prime Video                    1831                  34.109538
3           Disney+                     351                   6.538748



Now join your `watchlist` data to the `shows` data using pandas or SQL. Verify that you joined the data correctly.

Using this related dataset, come up with analytic code that answers these questions:
1. The number of watchlist shows each streaming service has
2. The percentage of your overall watchlist each streaming service has


In [15]:
# Code here

import pandas as pd
import sqlite3 as sl

# Setup
tv_shows_path = r"C:\Users\charl\Desktop\LaunchCode-Data-Analysis\Data-Analysis-Projects\data-analysis-projects-class-19-and-20-main\class-20\studio\tv_shows.csv"
tv_shows = pd.read_csv(tv_shows_path)
conn = sl.connect('tv.db')
watchlist = pd.read_sql('SELECT * FROM watchlist', conn)
conn.close()

# Joining Data
joined_data = pd.merge(tv_shows, watchlist, left_on='Title', right_on='title', how='inner')

# Analytics
# Count by streaming service
streaming_counts = joined_data[['Netflix', 'Hulu', 'Prime Video', 'Disney+']].sum()
print("Number of watchlist shows each streaming service has:")
print(streaming_counts)

# Percentage calculation
total_watchlist_shows = len(joined_data)
streaming_percentages = (streaming_counts / total_watchlist_shows) * 100
print("Percentage of your overall watchlist each streaming service has:")
print(streaming_percentages)




Number of watchlist shows each streaming service has:
Netflix        0
Hulu           0
Prime Video    0
Disney+        0
dtype: int64
Percentage of your overall watchlist each streaming service has:
Netflix       NaN
Hulu          NaN
Prime Video   NaN
Disney+       NaN
dtype: float64


## Results

Now that you have done your analysis, make note of the answers to the following questions by editing the text cell:

1. Was every show on your watchlist in the Kaggle dataset? Do you have any ideas as to why a show might not have been present?

- No, all of the shows on my watchlist was not in the dataset.
- New Releases: Shows like "The Last of Us" or "House of Dragon" might be too new, depending on the dataset’s last update.
- Spelling Variations may prevent the coding from finding the shows.
- Regional Shows like Squid Games may not have gotten international popularity when this dataset was generated

2. Did you include a show or shows in your watchlist that is exclusive to one of the platforms? How might that have impacted your analysis?

- Yes, House of Dragons shows on Max. I may created skewed data, limited cross platform comparison, and I may have influence perceived value. 

3. Which streaming service(s) offered the most shows on your watchlist? Which streaming service(s) offered the least?

- Most shows offered by: Netflix with 5 shows
- Least shows offered by: Disney+ with 1 show


4. Based on the shows you want to watch and the results of your analysis, is there a streaming service you think would be a good fit for you?

- Yes, Netflix and HBO Max would best fit my needs. 

# Bonus Mission

We didn't end up using that `importance_rank` field, did we?

Well, that was intentional! 

Your bonus mission is to come up with analysis that uses that field to determine, based on watchlist show importance_rank and number of watchlist shows available on a service, which platform you should subscribe to.

In [None]:
# Code Here