### What data did you choose? Specify the source, briefly describe the data

I decided to take data from the MyAnimeList website, a popular platform for evaluating anime and synchronizing progress on watching various titles. People can add to their collection already watched anime or those that are currently watching, rate them, and write reviews. The dataset itself was taken from kaggle (https://www.kaggle.com/datasets/hossamahmedsalah/anime-mal-dataset ) and looks like this:

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
pd.options.mode.chained_assignment = None

In [3]:
df = pd.read_csv('anime_filtered.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14474 entries, 0 to 14473
Data columns (total 31 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   anime_id        14474 non-null  int64  
 1   title           14474 non-null  object 
 2   title_english   5723 non-null   object 
 3   title_japanese  14440 non-null  object 
 4   title_synonyms  8936 non-null   object 
 5   image_url       14378 non-null  object 
 6   type            14474 non-null  object 
 7   source          14474 non-null  object 
 8   episodes        14474 non-null  int64  
 9   status          14474 non-null  object 
 10  airing          14474 non-null  bool   
 11  aired_string    14474 non-null  object 
 12  aired           14474 non-null  object 
 13  duration        14474 non-null  object 
 14  rating          13932 non-null  object 
 15  score           14474 non-null  float64
 16  scored_by       14474 non-null  int64  
 17  rank            12901 non-null 

The dataset contains 30 useful attributes, such as information about the title itself (genre, age limit, when it was released, musical accompaniment, etc.), as well as about its indicators on the site itself, formed from viewers (estimate how many people added to the collection / favorites, how many people rated, etc.) 

### Describe what main task within the project you want to solve using this data (check the pattern / predict something / identify subgroups, etc.)

Most likely I will try to evaluate the success of the anime by the set values, specifically its assessment on the site. Perhaps it will be possible to identify subgroups of anime by various metrics of its success and find a certain group of anime that receives the highest ratings from the audience.

To begin with, let's try to look at the most successful anime on the site and first understand which parameters statistically significantly affect the assessment. Let's start with more or less obvious parameters, for example, the age limit:

In [4]:
df_nozeros = df[df.score > 0] # let's remove outliers in the form of zero estimates

In [108]:
df_nozeros.groupby('rating') \
       .agg({'rating':'size', 'score':'mean'}) \
       .rename(columns={'rating':'count','score':'mean_score'}) \
       .sort_values(by='mean_score').tail(10)

Unnamed: 0_level_0,count,mean_score
rating,Unnamed: 1_level_1,Unnamed: 2_level_1
G - All Ages,4519,5.695592
Rx - Hentai,1212,6.172483
PG - Children,1268,6.194487
R+ - Mild Nudity,876,6.401279
PG-13 - Teens 13 or older,4928,6.790899
R - 17+ (violence & profanity),964,6.967085


The top shows the most highly rated genres (their average score and number). I assume that anime with an age rating of 17+ is rated higher on average than 13+. Let's test this hypothesis:

In [109]:
score_17 = df_nozeros.score[(df_nozeros.rating=="R - 17+ (violence & profanity)")]
score_13 = df_nozeros.score[(df_nozeros.rating=="PG-13 - Teens 13 or older")]
stats.ttest_ind(score_17, score_13, nan_policy = "omit")

TtestResult(statistic=5.372697995743711, pvalue=8.054429510800736e-08, df=5890.0)

In [110]:
score_17.mean() - score_13.mean()

0.17618611743546886

Thus, the null hypothesis about the equality of the average score for these two age restrictions is rejected - their difference is statistically significant (at the level of 0.01). Now let's look at the average score depending on the original source of the anime:

In [111]:
df_nozeros.groupby('source') \
       .agg({'source':'size', 'score':'mean'}) \
       .rename(columns={'source':'count','score':'mean_score'}) \
       .sort_values(by='mean_score').tail(10)

Unnamed: 0_level_0,count,mean_score
source,Unnamed: 1_level_1,Unnamed: 2_level_1
Unknown,4197,5.996152
Book,93,6.290968
Game,572,6.409755
Visual novel,877,6.531733
Card game,54,6.612963
Web manga,136,6.627647
4-koma manga,214,6.797757
Novel,337,6.842967
Manga,3025,6.975474
Light novel,531,7.210207


Let's try to compare the average rating of anime in a similar way, the primary source of which was Light novel and Manga:

In [112]:
score_lnovel = df_nozeros.score[(df_nozeros.source=="Light novel")]
score_manga = df_nozeros.score[(df_nozeros.source=="Manga")]
stats.ttest_ind(score_lnovel, score_manga, nan_policy = "omit")

TtestResult(statistic=6.081857011970487, pvalue=1.313844727363384e-09, df=3554.0)

In [113]:
score_lnovel.mean() - score_manga.mean()

0.23473277614356203

In this case, the null hypothesis of equality is also rejected: there is statistical significance in the difference in the average score between the primary sources.

### Describe the main conclusions after testing the hypotheses

After testing the hypotheses about the equality of averages, it can be concluded that anime with an age limit of 17+ have a score higher than 13+ by about 0.18. In addition, the original source also has a statistically significant effect on the score: those anime that are based on manga, on average 0.23 are rated worse than those based on light novels. These differences are quite high and make sense (about 2% of the scale), so in the future, most likely, when building a regression model, these regressors will have a statistically significant weight.

### Choose one of the considered research questions. Calculate the required sample size if you were planning to answer this question using an experiment. Explain how you estimated the necessary parameters (effect size, spread, etc.)

In my case, it would be strange to conduct an AB test, since the anime release is very graguate throughout the season - moreover, it takes even more time for users to start evaluating it. In any case, we will calculate the sample size for the experiment of checking the averages, depending on the original source:

In [114]:
((0.841621 + 2.326348)*np.sqrt(np.std(score_lnovel)**2 + np.std(score_manga)**2)/ 0.2)**2

289.80430852489474

The significance level is 0.01, the power is 0.8. The corresponding Z values were substituted from the table. We will estimate the standard deviations on the existing samples, and the effect size will be 0.2 - about the same as it was detected. In total, the sample sizes should be 290 in order to find an effect of 0.2 at a given level of significance and power.