**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Sandra Gomez
- Aydin Tabatabai
- Brandon Ng
- Sophia Yu
- Aarya Patel

# Research Question

What factors have the greatest impact on the popularity of Japanese animation (anime)? Is it through the author, animation company, genre, episode count, or whether it’s done airing or still running? 

## Background and Prior Work

Anime became popular in the 80s internationally but it wasn't until the 90s and early 2000s that it gained traction in the United States with shows like Pokemon, Dragon Ball and Sailor Moon being aired on television.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)  Now platforms like Netflix, Crunchyroll and other streaming services make it easier for fans of the medium to come together. On this topic, we know that One Piece was the best-selling manga series from 2008 until 2018.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) We then wondered if the One Piece anime was one of the most watched anime in the recent decade. Was it possible that its 1,000 chapter manga and 1094 episode adaptation (and counting!) allured audiences? Or was its length an outlier among other successful animes? This made us want to specifically investigate what are the specific trends between the most popular anime today. Does it matter how old the anime is, how long or if it's ongoing for it to have a high ranking among popularity? 

Themarysue.com investigates how anime became more accessible to foreign audiences through the internet.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) This could be helpful to our project to see if popularity is affected by the year of release. There could be an argument that the internet is a massive factor in the ever-increasing sales. The article states that at first when the internet was first developing people would pirate content (although it was not easy) which eventually led the Japanese production companies to provide their animations for simulcasting. Now international audiences can keep up with local audiences which could increase a sense of community and the number of loyal fans. Thus popularity would be affected. 
  
Themedialab.me investigates how streaming services have changed video media popularity by addressing seven key factors behind Netflix's popularity as a streaming service.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) It would beat out even using television broadcasting for views. It concluded that being flexible, personalized, containing variety, and using technology to its advantage allowed it to become successful. Another website, cosmopolitan.com, tracked the most-watched Netflix series of all time in February 2024. <a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5) And it is true, every show mentioned (Wednesday, Stranger Things, Dahmer – Monster, Bridgerton, etc.) had similar genres (supernatural, horror, historical) but were all targeting very different demographics. This is different to anime where the demographics also act like a genre which means that we can generalize some of the themes in the anime as we have not read every single item in the data we collected. However, we believe that there is still uniqueness within these animes and so if we find that a genre trends more than the other we can assume that it is trending because there are a lot of unique storylines. Just like how though Wednesday and Stranger Things both have supernatural elements, they are very different experiences. This would show that as anime streaming is more flexible and personal to the viewer’s interests then it is more accurate to what fans actually enjoy. Not just what companies thought would do well with international audiences.  

1. <a name="cite_note-1"></a> [^](#cite_ref-1) D'souza, D. (05 Jun 2023) How did Anime go from Nerdy Cringey to mainstream Pop-Culture?. *LinkedIn*. https://www.linkedin.com/pulse/how-did-anime-go-from-nerdy-cringey-mainstream-deepa-d-souza/ 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Loo, E. (04 Dec 2009) 2009's Top-Selling Manga in Japan, by Series. *Anime News Network*. https://www.animenewsnetwork.com/news/2009-12-04/2009-top-selling-manga-in-japan-by-series 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Polo, S. (04 Jun 2014) Technology and Anime, a Beautiful Love Story. *The Mary Sue*. https://www.themarysue.com/technology-anime-history/ 
4. <a name="cite_note-4"></a> [^](#cite_ref-4) 7 Key Factors Behind The Success Story Of Netflix. *The Media lab*. https://www.themedialab.me/7-key-factors-behind-success-story-netflix/ 
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Venn, L. (08 Feb 2024) Netflix: The most watched TV series' of all time. *Cosmopolitan*. https://www.cosmopolitan.com/uk/entertainment/g42176068/netflix-most-watched/ 


# Hypothesis


We believe that publisher and demographic/genre will be strong factors correlating with high popularity in anime. For example, more established publishing companies could be more likely to have a consistent fanbase from previous works, certain demographics might be more targeted (thus leading to higher views), and certain genres could tend to attract more viewers than others, correlating to higher popularity.

# Data

## Data overview

- Dataset #1
  - Dataset Name: Anilist API
  - Link to the dataset: https://github.com/AniList/ApiV2-GraphQL-Docs?tab=readme-ov-file
  - Number of observations: 1500
  - Number of variables: 69

Anilist API is an API that provides quick and powerful access to over 500k anime and manga entries. Key categories in the schema including media, mediatrend, character, staff, and live airing data which contain important variables within each schema like anime name, release date, ranking, and popularity. The data types range from categorical variables, such as anime name, genres, and type, to numerical metrics like ranking, number of episodes, and popularity. Variables may serve as proxies for audience preferences and content popularity through genres, average scores of likes, popularity, and ranking. To wrangle and preprocess the dataset, tasks may involve quering the API to access the data, standardizing data formats, ensuring consistency in naming conventions, and handling missing values, especially when certain animes only have Japanese names and no English name. 

## Anilist API Dataset

In [2]:
import requests
import json
import pandas as pd
import numpy as np



### Fetching Data from API and Downloading

In [3]:
url = 'https://graphql.anilist.co'

query = '''
query ($id: Int, $page: Int, $perPage: Int, $search: String) {
    Page (page: $page, perPage: $perPage) {
        pageInfo {
            total
            currentPage
            lastPage
            hasNextPage
            perPage
        }
        media (id: $id, search: $search, type: ANIME) {
            title {
                romaji
                english
            }
            startDate {
                year
                month
            }
            endDate {
                year
                month
            }
            season
            seasonYear
            tags {
                name
            }
            format
            status
            episodes
            genres
            popularity
            averageScore
            countryOfOrigin
            source
            studios {
                edges {
                    node {
                        name
                    }
                }
            }
            rankings {
                rank
            }

        }
    }
}
'''
variables = {
    'page': 1,
    'perPage': 50
}

all_data = []

while True:
    response = requests.post(url, json={'query': query, 'variables': variables})
    if response.status_code == 200:
        data = response.json()
        all_data.extend(data['data']['Page']['media'])
        pageInfo = data['data']['Page']['pageInfo']
        if not pageInfo['hasNextPage']:
            break
        variables['page'] += 1  # Increment the page number for the next request
    else:
        print(f"Failed to fetch data: {response.status_code}")
        print(f"Error message: {response.text}")

        break


KeyboardInterrupt: 

### Creating Dataframe from the API data

In [None]:
api_df = pd.DataFrame(all_data)
api_df.head()

### Data Cleaning
All the data in each column are in dictionaries so we need to pull the values out of them

In [None]:
# Clean title column from dictionary to string
def getTitle(t):
    if t['english'] is not None:
        return t['english']
    else:
        return t['romaji']
    
api_df['title'] = api_df['title'].apply(getTitle)
api_df.head()

In [None]:
# Clean date columns from dictionary to 'month-year'
def getDate(d):
    if isinstance(d, dict) and 'year' in d and 'month' in d:
        if d['year'] is not None and d['month'] is not None:
            return f"{d['month']}-{d['year']}"
        elif d['year'] is not None and d['month'] is  None:
            return f"{d['year']}"
    else:
        return None
api_df['startDate'] = api_df['startDate'].apply(getDate)
api_df['endDate'] = api_df['endDate'].apply(getDate)
api_df.head()


In [None]:
# Clean studio column from dictionary to string
def getStudios(s):
    edges = s.get('edges', [])
    node = edges[0]['node']['name'] if edges and edges[0].get('node') else None
    return node
api_df['studios'] = api_df['studios'].apply(getStudios)
api_df.head()

In [None]:
# Clean rankings getting the global ranking
def getRanking(r):
    if len(r) > 0:
        return r[0]['rank']
    else:
        return None

getRanking(api_df['rankings'].iloc[0])
api_df['rankings'] = api_df['rankings'].apply(getRanking)
api_df.head()

In [None]:
# Clean tags columns converting dictionary to array
def convertTags(d):
    result = []
    if len(d) <= 0:
        return result
    else:
        for dic in d:
            if isinstance(dic, dict) and 'name' in dic:
                result.append(dic['name'])
        return result
api_df.loc[:, 'tags'] = api_df['tags'].copy().apply(convertTags)
api_df.head()

In [None]:
# Fill and remove NaN for certain columns

# If the title of a anime doesn't exist remove the row
api_df = api_df[api_df['title'].notna()] 

# If the rankings, popularity, or averageScore is NaN then convert to 0
api_df.loc[:, 'rankings'] = api_df['rankings'].fillna(0)
api_df.loc[:, 'popularity'] = api_df['popularity'].fillna(0)
api_df.loc[:, 'averageScore'] = api_df['averageScore'].fillna(0)

api_df.head()

NameError: name 'api_df' is not defined

### Catching Edge Cases

In [None]:
# Replace NaN value to 'Ongoing' for status for animes that are still releasing 
api_df.loc[api_df['status'] == 'RELEASING', 'endDate'] = 'Ongoing'
api_df[api_df['status'] == 'RELEASING']
api_df

In [None]:
# Replace NaN value for season and seasonYear to startDate values if season and seasonYear is NaN
month_season_dict = {1: 'WINTER', 2: 'WINTER', 3: 'SPRING',
                     4: 'SPRING', 5: 'SPRING', 6: 'SUMMER',
                     7: 'SUMMER', 8: 'SUMMER', 9: 'FALL',
                     10: 'FALL', 11: 'FALL', 12: 'WINTER'}
for index, row in api_df.iterrows():
    if pd.isnull(row['season']) or pd.isnull(row['seasonYear']):  
        start_date = row['startDate']
        if '-' in start_date:
            month, year = start_date.split('-')
            api_df.at[index, 'season'] = month_season_dict[int(month)]
            api_df.at[index, 'seasonYear'] = year
        else:
            api_df.at[index, 'seasonYear'] = start_date

In [None]:
# Download csv file
# (so we don't have to query the api everytime and cause a request limit)
api_df.to_csv('anime-dataset.csv', index=False)

In [4]:
df = pd.read_csv('anime-dataset.csv')
df

Unnamed: 0,title,startDate,endDate,season,seasonYear,tags,format,status,episodes,genres,popularity,averageScore,countryOfOrigin,source,studios,rankings
0,Cowboy Bebop,4-1998,4-1999,SPRING,1998.0,"['Space', 'Crime', 'Episodic', 'Ensemble Cast'...",TV,FINISHED,26.0,"['Action', 'Adventure', 'Drama', 'Sci-Fi']",339045,86.0,JP,ORIGINAL,Sunrise,46.0
1,Cowboy Bebop: The Movie - Knockin' on Heaven's...,9-2001,9-2001,SUMMER,2001.0,"['Terrorism', 'Primarily Adult Cast', 'Crime',...",MOVIE,FINISHED,1.0,"['Action', 'Drama', 'Mystery', 'Sci-Fi']",63194,82.0,JP,ORIGINAL,bones,44.0
2,Trigun,4-1998,9-1998,SPRING,1998.0,"['Guns', 'Fugitive', 'Male Protagonist', 'Prim...",TV,FINISHED,26.0,"['Action', 'Adventure', 'Comedy', 'Drama', 'Sc...",120520,79.0,JP,MANGA,MADHOUSE,279.0
3,Witch Hunter ROBIN,7-2002,12-2002,SUMMER,2002.0,"['Female Protagonist', 'Police', 'Magic', 'Urb...",TV,FINISHED,26.0,"['Action', 'Drama', 'Mystery', 'Supernatural']",16120,67.0,JP,ORIGINAL,Sunrise,24.0
4,Beet the Vandel Buster,9-2004,9-2005,FALL,2004.0,['Shounen'],TV,FINISHED,52.0,"['Adventure', 'Fantasy', 'Supernatural']",2238,62.0,JP,MANGA,Toei Animation,61.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,Kurogane Communication,10-1998,3-1999,FALL,1998.0,"['Robots', 'Post-Apocalyptic', 'Artificial Int...",TV,FINISHED,24.0,"['Action', 'Adventure', 'Drama', 'Sci-Fi']",1496,58.0,JP,MANGA,APPP,30.0
1496,Cutie Honey,10-1973,3-1974,FALL,1973.0,"['Female Protagonist', 'Episodic', 'Shounen', ...",TV,FINISHED,25.0,"['Action', 'Adventure', 'Ecchi', 'Mahou Shoujo...",5296,58.0,JP,MANGA,Toei Animation,0.0
1497,Space Fantasia 2001 Nights,6-1987,6-1987,SUMMER,1987.0,"['Space', 'Seinen', 'Time Skip']",OVA,FINISHED,1.0,['Sci-Fi'],1111,54.0,JP,MANGA,TMS Entertainment,0.0
1498,Haha wo Tazunete Sanzenri,1-1976,12-1976,WINTER,1976.0,"['Classic Literature', 'Historical', 'Male Pro...",TV,FINISHED,52.0,"['Adventure', 'Drama']",3560,69.0,JP,OTHER,Nippon Animation,0.0


# Ethics & Privacy

Regarding the project's ethics and privacy, we will guarantee that our approach respects these issues while remaining lawful and fair. This includes addressing any ethical and privacy concerns that may develop throughout the course of our project. The data that we've suggested can carry some inherent biases in which the selection of specific variables can potentially lead us to overlook certain factors while placing excessive emphasis on others. For instance, by focusing mainly on online popularity metrics and sales figure, we might undervalue genres that are less prevalent online but have significant cultural importance and dedicated viewership in specific regions. Additionally, there could be a geographical bias, as the data might be sourced predominantly from a local US database, potentially overlooking the perspectives of other countries in an international database that represent the popularity of anime and manga in non-English speaking countries. 

To detect biases, our approach involves generalizing the data and creating subgroups for a more detailed exploration of data. We can actively assess the effects of cultural and regional bias, particularly when comparing the popularity of manga in different countries. Additionally, we can consider the option of building a more diverse dataset that utilizes both international and local datasets, if possible, or alternatively, acknowledge and narrow the scope of generalizations based on situational contexts. We can also incorporate new variables that can influence popularity and adjust the weight of each variable to enhance the accuracy and fairness of our analysis. Concerning potential issues related to data privacy and equitable impact in our topic area, we recognize the risk of influencing the popularity of specific manga through our research, inadvertently popularizing certain mangas or discouraging others. However, given that all the data used is publicly available, data privacy concerns are minimal.

In the context of privacy, the data utilized for our project is sourced from a publicly available API, [Anilist](https://anilist.gitbook.io/anilist-apiv2-docs/). Anilist has 5 terms involving its use which include non-commercial use, prohibition of data storage service utilization, prevention of mass data collection, adherence to naming guidelines, and compliance with restrictions on competing services which our usage of the data adheres to. Furthermore, our project doesn't involve human subjects which will not violate privacy standards of leaking personal information about individuals. 


# Team Expectations 

* *Maintain open and regular communication to keep all team members informed and engaged.*
* *Take responsibility for your contributions, acknowledging both successes and areas for improvement.*
* *Work collaboratively, offering and seeking support to leverage the team's collective strengths.*
* *Acknowledge the importance of personal time, promoting flexibility to enhance well-being and productivity.*
* *Prioritize transparency, especially during busy periods, to manage expectations and maintain trust.*
* *Cultivate a collaborative and supportive team Eevironment, prioritizing reliability, accountability, respect, openness to feedback, and a willingness to learn from mistakes*

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/6  |  1 PM |  Do our own research on potential topic | Find and discuss topics; Choose a topic; Start looking at related datasets; Assign group members to complete each section of the project proposal | 
| 2/20  | 1 PM  | Find datasets correponding to the different time periods we want to analyze  | Discuss which datasets we want to use and how we will compile our data; Select datasets that we want to use and assign each person to think about how we will clean the data   |
| 2/24  | 4 PM  | Think about what datasets are compatible | Review/Edit wrangling/EDA (and review for submission); Discuss Analysis Plan  |
| 2/25  | Before 11:59 PM  | Review project checkpoint | Turn in Checkpoint #1: Data |
| 2/27  | 1 PM  | Think about analysis details and what we need to complete | Discuss analysis; Assign subtasks for data analysis; Complete project check-in |
| 3/5   | 1 PM   | Review analysis and draft conclusion points | Discuss/edit full project; Assign remaining necessary tasks |
| 3/10  | Before 11:59 PM  | Review project EDA | Turn in Checkpoint #2: EDA |
| 3/12  | 1 PM   | Continue to work on final tasks | Review project; Discuss points of improvement |
| 3/20  | Before 11:59 PM  | Finalize project | Turn in Final Project & Group Project Surveys |