<a href="https://colab.research.google.com/github/carlosfmorenog/CM4125/blob/main/CM4125_T7/CM4125_Lab7_1_JoinsAggregation_solved.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM4125 Topic 7 Lab 1, Joins and Aggregations (Solved)

The `rating.csv` dataset contains user ratings of anime

The data set is structed as follows:
    
| user_id | anime_id | rating |
| ------: | -------: | -----: |
|       1 |    11757 |     10 |
|       2 |    11771 |     10 |
|       3 |       20 |      8 |
|       3 |      154 |      6 |
|       3 |      170 |      9 |
|     ... |      ... |    ... |

Notice that no column is a unique index

In fact, `user_id` indicates the user doing the rating, `anime_id` indicates the anime, and `rating` is the user's rating of that anime from `1` to `10`

Import this dataset using the following cell, removing in advance all rows with missing data (you should have $5,484,219$ rows of data!)

In [None]:
import pandas as pd
rating = pd.read_csv(
    'https://www.dropbox.com/scl/fi/2lkp357dbt3oliizj8r2y/rating.csv?rlkey=ek73f60revrr8ik6mw9r8v62l&raw=1')
rating = rating[rating['rating'].notnull()]
rating

Unnamed: 0,user_id,anime_id,rating
0,1,11757,10
1,2,11771,10
2,3,20,8
3,3,154,6
4,3,170,9
...,...,...,...
5484214,73515,14345,7
5484215,73515,16512,7
5484216,73515,17187,9
5484217,73515,22145,10


Let's reorder the dataset as follows:

1. Group the data by `user_id`
2. Aggregate using the `count` operation
3. Sort by count of rating (descending)

In [None]:
user_group = rating.groupby('user_id')
user_agg = user_group.agg({'count'})
user_ratings = user_agg['rating']
user_ratings = user_ratings.sort_values('count', ascending=False)
user_ratings

Unnamed: 0_level_0,count
user_id,Unnamed: 1_level_1
42635,3243
51693,2507
57620,2422
59643,2314
7345,2192
...,...
10868,1
10851,1
51824,1
51830,1


If you did all of these steps correctly, now you should have a dataset that tells you how many animes each of the $69,180$ users has rated

Moreover, your `user_id` column should be now the index!

**Q. Which user (by ID) has rated the most anime?**

Now you will do the following:

1. Create a new dataframe called `anime_ratings` which shows the number of times each anime was rated and it's mean rating
2. Rename column `count` as `Number of Ratings` and `mean` to `Average Rating`
3. Drop any anime which received fewer than $100$ ratings and sort by `Number of Ratings` (descending)
4. Round the average ratings to 2 decimal places

In [None]:
anime_group = rating.groupby('anime_id')
anime_agg = anime_group.agg({'count', 'mean'})
anime_ratings = anime_agg['rating']
anime_ratings = anime_ratings.sort_values('count', ascending=False)
anime_ratings = anime_ratings.rename(columns={'count' : 'Number of Ratings', 'mean' : 'Average Rating'})
anime_ratings = anime_ratings[anime_ratings['Number of Ratings'] >= 100]
anime_ratings['Average Rating'] = anime_ratings['Average Rating'].round(2)
anime_ratings

Unnamed: 0_level_0,Number of Ratings,Average Rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1535,34226,8.83
11757,26310,8.14
16498,25290,8.73
1575,24126,8.93
6547,23565,8.55
...,...,...
1534,100,7.23
23831,100,6.62
7550,100,6.63
503,100,5.78


You should have a total of $3,513$ rated anime

**Q. Which Anime (by ID) was rated most often? How often? What was it's average rating?**

Answer: Anime $1,535$ with $34,226$ ratings with an average of 8.83 out of 10.


Now, we will import a second dataset called `anime.csv` to the variable `anime`.

In this one, the first column is the `anime_id` (which is unique)

You should have $10,228$ rows, each describing one anime

In [None]:
anime = pd.read_csv(
    'https://www.dropbox.com/scl/fi/yz1mw3t7ufge4i7then4e/anime.csv?rlkey=p5mzger0nc53uqdlf3qi6e3r2&raw=1',
    index_col=0)
anime

Unnamed: 0_level_0,name,genre,type,episodes
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1
5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64
28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51
9253,Steins;Gate,"Sci-Fi, Thriller",TV,24
9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51
...,...,...,...,...
11095,Zouressha ga Yatte Kita,Adventure,Movie,1
7808,Zukkoke Knight: Don De La Mancha,"Adventure, Comedy, Historical, Romance",TV,23
28543,Zukkoke Sannin-gumi no Hi Asobi Boushi Daisakusen,"Drama, Kids",OVA,1
18967,Zukkoke Sannin-gumi: Zukkoke Jikuu Bouken,"Comedy, Historical, Sci-Fi",OVA,1


We want to add the ratings from the `anime_ratings` dataframe onto the `anime` dataframe

Every rating in `anime_ratings` has a corresponding entry in `anime`

However, since we removed the entries of lesser rated anime (in the `anime_ratings` dataset), there are anime in the `anime` dataframe without corresponding entries in `anime_ratings`

**Q. If we want to keep all of the information about anime (regardless of whether we have the rating or not), what type of join should we us, considering `anime` as the left and `anime_ratings` as the right dataset?**

Answer: Left join, since we want to keep everything from the left dataset

Note: Outer join will work too, since there are no right keys unmatched on the left

**Q. If we want to only keep anime which have ratings, what type of join should we use?**

Answer: Inner join, we want to keep the overlap

Note: Right join will work too since there are no right keys unmatched on the left

Perform a join to keep only anime which have ratings

You should have 3,513 rows

In [None]:
anime = anime.join(anime_ratings, how='inner')
anime

Unnamed: 0_level_0,name,genre,type,episodes,Number of Ratings,Average Rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,1961,9.43
5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,21494,9.32
28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,1188,9.45
9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,17151,9.26
9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,3115,9.27
...,...,...,...,...,...,...
19315,Pupa,"Fantasy, Horror, Psychological",TV,12,2135,4.22
28929,Vampire Holmes,"Comedy, Mystery, Supernatural, Vampire",TV,12,133,4.19
16608,Shitcom,"Comedy, Romance",ONA,1,192,3.21
413,Hametsu no Mars,"Horror, Sci-Fi",OVA,1,1024,2.47


Sort the data frame by `Average Rating` (descending)

**Q. Which anime was most highly rated?**

In [None]:
anime = anime.sort_values('Average Rating', ascending=False)
anime

Unnamed: 0_level_0,name,genre,type,episodes,Number of Ratings,Average Rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,1188,9.45
32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,1961,9.43
820,Ginga Eiyuu Densetsu,"Drama, Military, Sci-Fi, Space",OVA,110,803,9.39
5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,21494,9.32
9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,3115,9.27
...,...,...,...,...,...,...
28929,Vampire Holmes,"Comedy, Mystery, Supernatural, Vampire",TV,12,133,4.19
1657,Byston Well Monogatari: Garzey no Tsubasa,Fantasy,OVA,3,110,4.01
16608,Shitcom,"Comedy, Romance",ONA,1,192,3.21
413,Hametsu no Mars,"Horror, Sci-Fi",OVA,1,1024,2.47


Answer: Gintama° with a rating of 9.45 out of 10

**Q. Which anime had the most ratings?**

In [None]:
anime = anime.sort_values('Number of Ratings', ascending=False)
anime

Unnamed: 0_level_0,name,genre,type,episodes,Number of Ratings,Average Rating
anime_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1535,Death Note,"Mystery, Police, Psychological, Supernatural, ...",TV,37,34226,8.83
11757,Sword Art Online,"Action, Adventure, Fantasy, Game, Romance",TV,25,26310,8.14
16498,Shingeki no Kyojin,"Action, Drama, Fantasy, Shounen, Super Power",TV,25,25290,8.73
1575,Code Geass: Hangyaku no Lelouch,"Action, Mecha, Military, School, Sci-Fi, Super...",TV,25,24126,8.93
6547,Angel Beats!,"Action, Comedy, Drama, School, Supernatural",TV,13,23565,8.55
...,...,...,...,...,...,...
1932,Yes! Precure 5,"Action, Fantasy, Magic, School, Shoujo",TV,49,100,7.38
1534,Futari wa Precure: Splash☆Star,"Action, Comedy, Fantasy, Magic, Shoujo",TV,49,100,7.23
23831,Mahou Shoujo Madoka★Magica Movie 3: Hangyaku n...,Comedy,Movie,4,100,6.62
1081,Space Pirate Captain Herlock: Outside Legend -...,"Action, Adventure, Drama, Sci-Fi, Seinen, Space",OVA,13,100,7.78


Answer: Death Note with $34,226$ ratings

The types of anime are `Movie`, `Music`, `ONA`, `OVA`, `Special` and `TV`

Use data aggregation to produce the following summary of the anime

| type    | Number of Anime | Median Number of Ratings per Anime |
| :------ | --------------: | ---------------------------------: |
| TV      |            1713 |                              921.0 |
| OVA     |             602 |                              349.5 |
| Movie   |             555 |                              719.0 |
| Special |             513 |                              350.0 |
| ONA     |              93 |                              439.0 |
| Music   |              37 |                              255.0 |

In [None]:
group = anime.groupby('type')
agg = group.agg({'Number of Ratings': ['count', 'median']})
agg.columns = ['Number of Anime', 'Median Rating']
agg = agg.sort_values('Number of Anime', ascending=False)
print(agg)

         Number of Anime  Median Rating
type                                   
TV                  1713          921.0
OVA                  602          349.5
Movie                555          719.0
Special              513          350.0
ONA                   93          439.0
Music                 37          255.0
