# Task 1. Set up an “Analysis Pipeline” (20%)¶
Each person in a group should do this Task in their own Jupyter notebook!

Often when Data Scientists do analyses with the same or similar datasets, they set up an “analysis pipeline”. This has several advantages:

* record the steps so you can remember what you did.

* allows you to repeat the steps reproducibly, without doing a bunch of manual and repetitive work.

* make changes to thes series of processing steps so you can improve and iterate.

* troubleshoot and debug errors in your processing.

* allows others to reproduce your analysis.

* if your data changes, you can update your outputs (report, images, etc…) easily without redoing all your processing.

* allows you to spend more effort and energy on your analysis and visualizations (if you do a good job with the pipeline).



# Analysis Pipeline
1. Load Data

* Load data using `pandas.read_csv` with raw data at `data/raw/anime.csv`

* delimiters (space, comma, tab) are handled by Pandas

* Skip rows that have `Unknown` data

2. Clean Data

* Remove columns not being used like `['MAL_ID','Producers', 'Licensors', 'English name', 'Japanese name'`

* Deal with “incorrect” data like converting `Duration` into Minutes rather than string with the function `convert(episodeLength)`.

* Deal with missing data by using `dropna` to remove `NaN` valued data.

3. Process Data

* Create any new columns needed that are combinations or aggregates of other columns (examples include weighted averages, categorizations, groups, etc…).

* Find and replace operations (examples inlcude replacing the string ‘Strongly Agree’ with the number 5).

* Other substitutions as needed.

* Deal with outliers.

4. Wrangle Data

* Restructure data format (columns and rows).

* Merge other data sources into your dataset.

* Exploratory Data Analysis (not required for this Task).

* Data Analysis (not required for this Task).

* Export reports/data analyses and visualizations (not required for this Task).



In [1]:
#Import depedancies
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from scripts import *

In [21]:
# Method Chaining
# Loading data
df = pd.read_csv('../../data/raw/anime.csv')

#load and process
pdf = project_functions.load_and_process(df)
pdf



Unnamed: 0,Name,Score,Genres,Type,Episodes,Aired,Premiered,Studios,Source,Duration,...,Score-10,Score-9,Score-8,Score-7,Score-6,Score-5,Score-4,Score-3,Score-2,Score-1
0,Cowboy Bebop,8.78,"Action, Adventure, Comedy, Drama, Sci-Fi, Space",TV,26,"Apr 3, 1998 to Apr 24, 1999",Spring 1998,Sunrise,Original,24,...,229170.0,182126.0,131625.0,62330.0,20688.0,8904.0,3184.0,1357.0,741.0,1580.0
2,Trigun,8.24,"Action, Sci-Fi, Adventure, Comedy, Drama, Shounen",TV,26,"Apr 1, 1998 to Sep 30, 1998",Spring 1998,Madhouse,Manga,24,...,50229.0,75651.0,86142.0,49432.0,15376.0,5838.0,1965.0,664.0,316.0,533.0
3,Witch Hunter Robin,7.27,"Action, Mystery, Police, Supernatural, Drama, ...",TV,26,"Jul 2, 2002 to Dec 24, 2002",Summer 2002,Sunrise,Original,25,...,2182.0,4806.0,10128.0,11618.0,5709.0,2920.0,1083.0,353.0,164.0,131.0
4,Bouken Ou Beet,6.98,"Adventure, Fantasy, Shounen, Supernatural",TV,52,"Sep 30, 2004 to Sep 29, 2005",Fall 2004,Toei Animation,Manga,23,...,312.0,529.0,1242.0,1713.0,1068.0,634.0,265.0,83.0,50.0,27.0
5,Eyeshield 21,7.95,"Action, Sports, Comedy, Shounen",TV,145,"Apr 6, 2005 to Mar 19, 2008",Spring 2005,Gallop,Manga,23,...,9226.0,14904.0,22811.0,16734.0,6206.0,2621.0,795.0,336.0,140.0,151.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17178,Uma Musume: Pretty Derby (TV) Season 2,7.21,"Slice of Life, Comedy, Sports",TV,13,"Jan 5, 2021 to ?",Winter 2021,Studio Kai,Game,23,...,209.0,201.0,448.0,621.0,297.0,112.0,50.0,16.0,16.0,18.0
17224,Wonder Egg Priority,8.32,"Psychological, Drama, Fantasy",TV,12,"Jan 13, 2021 to ?",Winter 2021,CloverWorks,Original,23,...,7564.0,12054.0,12657.0,5626.0,1516.0,682.0,258.0,113.0,77.0,144.0
17229,Gebäude Bäude,6.33,"Sci-Fi, Comedy",TV,10,"Nov 8, 2020 to Dec 10, 2020",Fall 2020,Jinnis Animation Studios,Original,3,...,6.0,11.0,19.0,35.0,39.0,20.0,6.0,5.0,1.0,4.0
17328,Jimihen!!: Jimiko wo Kaechau Jun Isei Kouyuu!!,6.12,"Romance, Ecchi",TV,8,"Jan 4, 2021 to Feb 22, 2021",Winter 2021,Studio Hokiboshi,Manga,3,...,241.0,100.0,194.0,343.0,426.0,282.0,143.0,91.0,70.0,80.0


# Task 3 EDA
- Question 1 : Average `duration` of each anime episodes that are aired on `TV`
- Question 2 : `Studio` with the highest average `Score` or `Ranking`
- Question 3 : Most `Genres` seen in anime
- Question 4 : `Sources` with the highest `Score`
- Question 5 : Average `duration` of each anime episodes that are aired on `Movie

In [19]:
# Let's explore the general information about the dataset and see what we have
pdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3306 entries, 0 to 17469
Data columns (total 30 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name           3306 non-null   object
 1   Score          3306 non-null   object
 2   Genres         3306 non-null   object
 3   Type           3306 non-null   object
 4   Episodes       3306 non-null   object
 5   Aired          3306 non-null   object
 6   Premiered      3306 non-null   object
 7   Studios        3306 non-null   object
 8   Source         3306 non-null   object
 9   Duration       3306 non-null   object
 10  Rating         3306 non-null   object
 11  Ranked         3306 non-null   object
 12  Popularity     3306 non-null   int64 
 13  Members        3306 non-null   int64 
 14  Favorites      3306 non-null   int64 
 15  Watching       3306 non-null   int64 
 16  Completed      3306 non-null   int64 
 17  On-Hold        3306 non-null   int64 
 18  Dropped        3306 non-nul

# pdf.nunique(axis = 0)
- Unique Entry `count` on `Name` which is 3306
- Total of 489 `Studio`
- 5 kind of `Rating`
- 3253 unique `Users` 


In [20]:
pdf.nunique(axis = 0)

Name             3306
Score             408
Genres           2079
Type                1
Episodes          153
Aired            2542
Premiered         203
Studios           489
Source             15
Duration           37
Rating              5
Ranked           2863
Popularity       2801
Members          3253
Favorites        1340
Watching         2609
Completed        3150
On-Hold          2455
Dropped          2587
Plan to Watch    3147
Score-10         2221
Score-9          2331
Score-8          2674
Score-7          2825
Score-6          2652
Score-5          2379
Score-4          1849
Score-3          1352
Score-2          1003
Score-1           956
dtype: int64

# From DF.describe
- we can see the number of entries in the table based on the `count` on `MAL_ID` which is 17562


In [22]:
pdf.describe()

Unnamed: 0,Popularity,Members,Favorites,Watching,Completed,On-Hold,Dropped,Plan to Watch
count,3306.0,3306.0,3306.0,3306.0,3306.0,3306.0,3306.0,3306.0
mean,3386.106473,136159.5,2090.861162,9885.641258,85880.78,4314.740472,5505.50605,30572.842408
std,2960.277736,249143.1,8684.466169,26374.033754,184813.6,8286.632403,9230.437097,43747.139267
min,1.0,363.0,0.0,12.0,0.0,4.0,16.0,102.0
25%,1026.75,9833.75,22.0,525.0,3774.0,386.25,570.0,3238.5
50%,2525.5,40423.0,141.0,2430.0,18876.5,1551.0,2216.0,12547.0
75%,4940.75,144790.2,836.25,9132.25,77365.5,4880.75,6951.0,40112.0
max,12953.0,2589552.0,183914.0,566239.0,2182587.0,130961.0,174710.0,425531.0
