# Data 301 Final Project

***

### Topic: Crunchyroll

### By: Anish Yakkala & Lemar Popal

### Github: https://github.com/ayakkala1/Data301/tree/master/crunchyroll

***

In [1]:
from IPython.display import HTML

## 1. Getting the Data

Our project involved scraping [Crunchyroll](https://www.crunchyroll.com/) information on Animes and respective Reviews. 

This project involved scraping $1002$ Shows and $73,392$ reviews!

<img src="images/Crunchy_Home.png">

__HOWEVER__ before we scrape we must check the `robots.txt`!

<img src="images/Crunchy_Robots.png">

It looks like we are all good to scrape what we need.

In order to get all the shows's informatio and their reviews we need an easy way of getting all their URL's. This is made easy since Crunchyroll page with all the show's in [alphabetical order](https://www.crunchyroll.com/videos/anime/alpha?group=all) does this in a easy way to scrape

<img src="images/Crunchy_Main_scrape.png">


We know have access to each page's home page and from there all their review pages.
<img src="images/Crunchy_Desc.png"> <img src="images/Crunchy_Reviews.png">


We iterate through all the review pages for each show until we get a page that says there is no reviews.

After scraping all the data into a list, we pickle it in order to retain the data's integrity (such as having columns of lists). Now is time to wrangle the data to make it ready for EDA.

You can see our scraper in action!

In [2]:
HTML('<blockquote class="imgur-embed-pub" lang="en" data-id="0VdXtFf"><a href="//imgur.com/0VdXtFf"></a></blockquote><script async src="//s.imgur.com/min/embed.js" charset="utf-8"></script>')

## 2. Wrangling the Data

How the raw data looks initially...

<img src="images/Messy_table.png">

We need to make these into tidy dataframes.

Through some work in our `Crunchyroll_wrangle` Notebook we got our raw dataset into three workable datasetes.

### Home Page Information

<img src="images/Home_tables.png">|

### Review Page Information

<img src="images/Review_table.png">

### Main Page information where each review is its own observation

<img src="images/Main_tables.png">

## 3. Data Exploration

### Anime Genre Trends since 2000's

One of our first analysis was looking at how genre's frequency changed over time. For the sake of not having a cluttered stacked bar chart, we only used the top $6$ Genres.

<img src="images/Genre_time.png">

***

### We then looked at shows that are often said to be the similar to another show.

<img src="images/Crunchy_Similar.png">

<img src="images/Top_similar.png">

***

### Use of English in Reviews Over Time

Crunchyroll has an international market, and so there are not just English Reviews. We used a library called `enchant` to encode if our review was in English or not.

<img src="images/English_time.png">

It's quite interesting how the variation in the reviews changes over time, and its slight increase over time.

***
### Use of the word 'opening' in reviews over time
An [anime opening](https://www.youtube.com/watch?v=2uq34TeWEdQ) can be a critical part of what people enjoy in a show.

Let's see how the use of "opening" changes over time in the Crunchy Roll Reviews.

<img src="images/opening_time.png">

***

An [anime opening](https://www.youtube.com/watch?v=2uq34TeWEdQ) can be a critical part of what people enjoy in a show.

Let's see how the use of "opening" changes over time in the Crunchy Roll Reviews.

<img src="images/opening_time.png">

### Top 5 Animes (# Reviews) sentiment of reviews over time

It is very expensive to use libraries like NLTK or Spacy to provide Sentiment values for your tokens. So we opt in to using a lexicon called [AFINN](http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010). Which assigns words a value from $[-5,5]$. Negative values indicate negative sentiment and positive values indicate positive sentiment. The closer the value is to $0$ the more neutral it is. $0$ being totally neutral.

<img src="images/Top_5_Anime_SentTime.png">

It appears that there is a slight trend down for all shows except for Hunter X Hunter, but quite noticably for Attack on Titan.

***

It appears that there is a slight trend down for all shows except for Hunter X Hunter, but quite noticably for Attack on Titan.

### Their actual ratings over time 

<img src="images/Top_5_Anime_RatingTime.png">

It appears that Black-Clover was the only one to go up in Ratings, while the rest follow similar trends that the sentiment over time graphs imply.

***

### Wordclouds for lower half of ratings vs higher half of ratings

Lastly we analyzed the words that appeared more frequently in the Lower Half of Ratings (Less than 2.5 Star) compared to the Higher Half of Ratings (Higher than 2.5 Stars)

In order to visualize the words that appear a lot more in the lower half compared to the higher half and vice versa, we opt in to use a Word Cloud. However the normal wordcloud in Python strictly uses Word Counts, which is the incorrect metric. To use our Per_Diff values as the weights we need to make than an integer and use a website that allows us to manually input the values.

We use https://www.wordclouds.com/

<img src="images/lower_half_table.png">

<img src="images/lowerhalfcloud.png">

Same for words that appeared more frequently in the higher half compared to the lower half.

<img src="images/higher_half_table.png">

<img src="images/higherhalfcloud.png">

***

## 3.5 Markov Models

We wanted to try implementing Markov Models to simulate ...
- review summaries
- review text
- anime descriptions
- names

Using a unigram

<img src="images/uni_anime.png">

<img src="images/uni_review.png">

Using a bigram

<img src="images/bi_anime.png">

<img src="images/bi_review.png">

## 4. Machine Learning

## Goal: Predict Crunchyroll Ratings for a Show 

__IMPORTANT NOTE:__

We ran into issues with cross val scores taking immensely long with our whole dataset. So we made the decision to do our project using a sample from our dataset. Our project can be run without it, and still be sound.

# Text Features

We will stay throughout using a __Normalizer__ as the scaler and __Euclidean__ as the metric in order to mimic taking Cosine Distance, since that is the best way for gauging distance between text data. Recall Lesson $10.2$.

### Stepwise

Our stepwise analysis of the text data

<img src="images/T_stepwise.png">

__Conclusion:__

We can see that Description, Tags, and Genre give the best performance, while review and similar are not.

### Feature Union

We want to include more than just one text column for our model, to do this we will need to apply a seperate TF-IDF vectorizer to each column. To be able to do this we need to use a Feature Union.

<img src="images/Feature_union.png">

We get an MSE of ...

<img src="images/Feature_union_val.png">

Using all our text columns gave us a MSE of $0.10288$. That is better than any one of single text columns MSE's.

However we say that some text columns were lacking, specifically "similar" and "reviews". Let's try dropping them and seeing if our model improves in performance 

Now we try only using the top 3 performers; desc, tags, and genre.

<img src="images/Feature_union_reduc.png">

<img src="images/Feature_union_reduc_val.png">

It looks like that dropping those two columns did help our MSE go down.

### Evaluating different $K$ values

<img src="images/T_train_val_table.png">

<img src="images/T_train_val.png">

It looks appears that $K = 5$ gives us the best MSE.

## Min DF

We know train on Minimum Document Frequency

<img src="images/T_min_df1_table.png">
<img src="images/T_min_df1.png">

It appears that making our Min-DF as 2 gives us the best results.

Let's try focusing in on the interval between $[0.1,0.2]$ to make sure we didn't miss out on a minimum.

<img src="images/T_min_df2_table.png">
<img src="images/T_min_df2.png">

It looks like it was good that we focused in on that interval $[0.1,0.2]$ since it now shows that a min_df of $0.17$ gives us the best MSE.

## Text Conclusion

The best model for the text data appears to be $5$ Neighbors KNearestRegressor, using all textual data but reviews and Similar, and having a min_df of $0.17$ for our TF-IDF Vectorizer, which gives us a MSE of $0.025864$.

# Quantitative Variables

We first look at the correlation matrix of all our quant variables and variable of interest

<img src="images/Q_corr.png">

Out of all the possible features, it appears that num_eps has the strongest positive correlation to aggregate rating, albeit a weak one. This makes sense since a show must have good reviews to be able to keep making more episodes. While datetime has the strongest negative correlation.

### Stepwise

<img src="images/Q_stepwise.png">

It appears that "num_eps" performed the best while datetime did poorly.

### Picking a model

<img src="images/Q_model_table.png">

The model with just num_eps and duration performed the best

### Evaluating different $K$ values

<img src="images/Q_train_val.png">

<img src="images/Q_train_val_graph.png">

It looks appears that $K = 25$ gives us the best MSE.

### Picking Distance Metrics & Scalars

<img src="images/Q_comb_table.png">

<img src="images/Q_comb_pick.png">

The model that used the `Chebyshev` metric and `MinMax` Scaler performed the best

### Conclusion

It appears for our quantitiative variables that a KNearestRegressor with 25 Neighbors, just num_eps and duration, Chebyshev metric and MinMax Scalers performs the best. 

# Ensembling Text & Quantitative Features

We now ensemble our two models using `straight averages`.

<img src="images/Ensemble.png">

Our estimated MSE is ...

<img src="images/Ensemble_val.png">

We get a MSE of $0.119215$ which is not better than just using the Text Model.

## Conclusion

Using just the Text Model gives us the best estimated MSE, 0.025864 for predicted Rating for a show.