# _Scraping MyAnimeList for Top Anime Series using Python_


#### _Use the "Run" button to execute the code._


![anime_intro](https://i.imgur.com/O5qf0Ms.jpg)





Anime originated in Japan and it has been phenomenal in capturing the audience in recent decades.The theatre release of animes in the past few months truely projects the extent of reach of animes .This is an art which  projects the emotions in an unbelievable way capturing the audience with its story telling .Anime music are super melodic and catchy .The escape it provides into a world of our imagination is truely remarkable. Because modern world culture has taken anime to its heart, this form of entertainment has significantly impacted personal relationships.Here I have scraped the website MyAnimeList and  provided the list of Top Animes around the world and ranked them according to the ratings and the link for the same is provided in the CSV file.

   MyAnimeList is a large Anime Database and community. It provides an updated list of the world's most popular anime series [Top Anime Series](https://myanimelist.net/topanime.php) ,in different categories such as All Anime, Top Airing ,Top Upcoming ,Top TV Series ,Top Movies ,Top OVAs ,Top ONAs, Top Specials ,Most Popular, Most Favorited and the Rank,Name of the Series and the Ratings .



![webpage_img](https://i.imgur.com/l8fysqD.png)


Web Scraping is the process through which we extract data from a website, and save it in a form which is easy to read, to understand and to work on.

In this project we retrieve the information in the webpage using python libraries such as [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/) and [Requests](https://requests.readthedocs.io/en/latest/)

Here are the outline of the steps to be followed:                                       
1)Download the webpage using Requests                              
2)Extract information from the HTML source code of a webpage programmatically, using the BeautifulSoup library.               
3)Create a BeautifulSoup object to parse the content within the Source code.                                   
4)Compile the extracted information into python lists and dictionaries.                                      
5)Create a dataframe of the scraped webpage using [pandas](https://pandas.pydata.org/docs/)                               
6)Convert the pandas dataframe to a csv file.                                       
7)Future work and reference

The contents of the CSV at the end of the project  contains `Rank` , `Name` ,`url` and the `Ratings` and the values are like        
`9,'Gintama',https://myanimelist.net/anime/9969/Gintama, 9.04`


## _Downloading the webpage using requests_



To understand web scraping, it’s important to first understand that web pages are built with text-based mark-up languages – the most common being HTML.

A mark-up language defines the structure of a website’s content. Since there are universal components and tags of mark-up languages, this makes it much easier for web scrapers to pull all the information that it needs. Once the HTML is parsed, the scraper then extracts the necessary data and stores it.
Note : Not all websites allow Web Scraping, especially when personal information of the users is involved, so we should always ensure that we do not explore too much, and don't get our hands on information which might belong to someone else. Websites generally have protections at place, and they would block our access to the website if they see us scraping a large amount of data from their website.


When you access a URL like https://myanimelist.net/topanime.php using a web browser, it downloads the contents of the web page the URL points to and displays the output on the screen. Before we can extract information from a web page, we need to download the page using Python.

We'll use a library called requests to download web pages from the internet. Let's begin by installing and importing the library.



In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

Downloading the webpage using requests.get function and getting the source code of the page into text format.

In [3]:
response=requests.get('https://myanimelist.net/topanime.php')

The contents of the web page can be accessed using the .text property of the response.

In [4]:
page_content=response.text

In [5]:
page_content[0:1000]

'\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n<html lang="en">\n<head>\n    \n<link rel="preconnect" href="//fonts.gstatic.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//fonts.googleapis.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//tags-cdn.deployads.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//www.googletagservices.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//www.googletagmanager.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//apis.google.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//pixel-sync.sitescout.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//pixel.tapad.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//c.deployads.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//tpc.googlesyndication.com/" crossorigin="anonymous"/>\n<link rel="preconne

In [6]:
print(response)

<Response [200]>


requests.get returns a response object with the page contents and some information indicating whether the request was successful, using a status code. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

If the request was successful, response.status_code is set to a value between 200 and 299.

In [7]:
response.ok

True

What you see above is the source code of the web page. It written in a language called [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML). It defines the content and structure of the web page.

Let's save the contents to a file with the .html extension.

In [8]:
with open('topanime.html','w') as file:
    file.write(page_content)

You can now view the file using the "File > Open" menu option within Jupyter and clicking on topanime.html in the list of files displayed. Here's what you'll see when you open the file:

![html_page](https://i.imgur.com/Np20H2n.png)










The html page will give error while trying to get to the link given in the page.                                               Now we have successfully downloaded the web page using requests.

## _Parse the  HTML Source code using BeautifulSoup_
To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let's install the library and import the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/class) from the bs4 module.We create an object of the Beautifulsoup library and the object contains several properties and methods for extracting information from the HTML document.Beautiful Soup is a Python package for parsing HTML and XML documents. Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It's a handy tool when it comes to web scraping.

![HTML_IMAGE](https://i.imgur.com/9mIMtZP.png)




In [9]:
#Install the beautifulsoup4 library
!pip install beautifulsoup4 --upgrade --quiet

In [10]:
#import the BeautifulSoup class fom the bs4 module
from bs4 import BeautifulSoup

Next, let's read the contents of the file  topanime.html and create a BeautifulSoup object to parse the content.

In [11]:
#reading the contents of the file 
with open ('topanime.html','r') as f:
        topanime_htmlSource=f.read()

Displaying the first 1000 lines of the source code

In [12]:
topanime_htmlSource[:1000]

'\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n<html lang="en">\n<head>\n    \n<link rel="preconnect" href="//fonts.gstatic.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//fonts.googleapis.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//tags-cdn.deployads.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//www.googletagservices.com/" crossorigin="anonymous" />\n<link rel="preconnect" href="//www.googletagmanager.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//apis.google.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//pixel-sync.sitescout.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//pixel.tapad.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//c.deployads.com/" crossorigin="anonymous"/>\n<link rel="preconnect" href="//tpc.googlesyndication.com/" crossorigin="anonymous"/>\n<link rel="preconne

In [13]:
# creating an object of the BeautifulSoup Library
anime_doc_demo=BeautifulSoup(topanime_htmlSource,'html.parser')

In [14]:
type(anime_doc_demo)

bs4.BeautifulSoup

Using the anime_doc_demo oject we can get information from the page.

In [15]:
def get_anime_doc (url):
    response=requests.get(url)
    if response.status_code != 200:
        print("Status code :",response.status_code)
        raise exception ("Failed to fetch the webpage",url)
    anime_doc=BeautifulSoup(response.text,'html.parser')
    return anime_doc

`get_anime_doc` is a function that returns a BeautifulSoup object for the url provided as the argument for the function.anime_doc has properties that allow us to read through the tags and attributes of the source code to get the required information.

   ## _Extract Rank,Anime_Name ,Anime_Links and  Ratings from page_

### _Extracting rank tags and returning a list_ 

We have to inspect the elements of the webpage to get to the required tags and attributes.

In [16]:
#Extracting the td tags from the anime_doc and conerting it into a list rank_list
rank_tags_demo=anime_doc_demo.find_all('td',class_='rank ac')
rank_tags_demo[:5]

[<td class="rank ac" valign="top">
 <span class="lightLink top-anime-rank-text rank1">1</span>
 </td>,
 <td class="rank ac" valign="top">
 <span class="lightLink top-anime-rank-text rank1">2</span>
 </td>,
 <td class="rank ac" valign="top">
 <span class="lightLink top-anime-rank-text rank1">3</span>
 </td>,
 <td class="rank ac" valign="top">
 <span class="lightLink top-anime-rank-text rank1">4</span>
 </td>,
 <td class="rank ac" valign="top">
 <span class="lightLink top-anime-rank-text rank1">5</span>
 </td>]

Lets define a function `get_ranks` that returns a list of ranks.

In [17]:
def get_ranks(doc):
    rank_tags=doc.find_all('td',class_='rank ac')
    return [tag.text.strip() for tag in rank_tags]

### _Extracting anime name tags and returning a list_

In [18]:
#Extracting the h3 from the anime_doc_demo
anime_name_tags_demo=anime_doc_demo.find_all('h3',class_='hoverinfo_trigger fl-l fs14 fw-b anime_ranking_h3')

In [19]:
#getting the value of anime_name_tags using index
anime_name_tags_demo[2].text

'Bleach: Sennen Kessen-hen'

Lets define a function `get_anime_names` that returns a list of anime_names.

In [20]:
def get_anime_names(doc):
    anime_name_tags=doc.find_all('h3',class_="hoverinfo_trigger fl-l fs14 fw-b anime_ranking_h3")
    return [tag.text for tag in anime_name_tags]

### _Extracting anime link tags and returning a list_

In [21]:
#Extracting the a tags from the anime_doc 
anime_link_tags_demo=anime_doc_demo.find_all('h3',class_="hoverinfo_trigger fl-l fs14 fw-b anime_ranking_h3")

In [22]:
#getting the value of anime_link_tags using index
anime_link_tags_demo[2].a['href']

'https://myanimelist.net/anime/41467/Bleach__Sennen_Kessen-hen'

Lets define a function `get_anime_links` that returns a list of anime_links.

In [23]:
#function to get the values from the anime_link_tags
def get_anime_links (doc):
    anime_link_tags=doc.find_all('h3',class_="hoverinfo_trigger fl-l fs14 fw-b anime_ranking_h3")
    return [tag.a['href'] for tag in anime_link_tags]

### _Extracting rating  tags and returning a list_

In [24]:
#Extracting the span tags from the anime_doc and converting it into a list ratings_list
rating_tags_demo=anime_doc_demo.find_all('td',class_="score ac fs14")
rating_tags_demo[:5]

[<td class="score ac fs14"><div class="js-top-ranking-score-col di-ib al"><i class="icon-score-star fa-solid fa-star mr4 on"></i><span class="text on score-label score-9">9.15</span></div>
 </td>,
 <td class="score ac fs14"><div class="js-top-ranking-score-col di-ib al"><i class="icon-score-star fa-solid fa-star mr4 on"></i><span class="text on score-label score-9">9.11</span></div>
 </td>,
 <td class="score ac fs14"><div class="js-top-ranking-score-col di-ib al"><i class="icon-score-star fa-solid fa-star mr4 on"></i><span class="text on score-label score-9">9.09</span></div>
 </td>,
 <td class="score ac fs14"><div class="js-top-ranking-score-col di-ib al"><i class="icon-score-star fa-solid fa-star mr4 on"></i><span class="text on score-label score-9">9.08</span></div>
 </td>,
 <td class="score ac fs14"><div class="js-top-ranking-score-col di-ib al"><i class="icon-score-star fa-solid fa-star mr4 on"></i><span class="text on score-label score-9">9.07</span></div>
 </td>]

In [25]:
#getting the value of rating_tags using index
rating_tags_demo[0].span.text

'9.15'

Lets define a function `get_ratings` that returns a list .

In [26]:
def get_ratings(doc) :
    ratings_tag=doc.find_all('td',class_="score ac fs14")
    return [tag.span.text for tag in ratings_tag]

Defining a function `get_top_anime` that gets information from different pages using multiiple functions and returns a dictionary `anime_dict` containing rank as the key and corresponding to each key a dictionary containing `Rank `,`Anime_name`,`Anime_link` and `Ratings` as values.

In [27]:
def get_top_anime (page_number):
    base_url="https://myanimelist.net/topanime.php"
    n=(page_number-1)*50
    url=base_url+"?limit={}".format(n)

     
    anime_doc=get_top_anime_doc(url)
    ranks_list=get_rank_tags(anime_doc)
    anime_names_list=get_anime_name_tags(anime_doc)
    anime_links_list=get_anime_link_tags(anime_doc)
    ratings_list=get_rating_tags(anime_doc)
    anime_dict={}
    for rank,name,link,rating in zip(ranks_list,anime_names_list,anime_links_list,ratings_list):
                anime_dict[rank]={
                    "Rank":rank,
                    "Anime_name": name,
                    "Anime_link":link,
                    "Ratings":rating
                }
            
    return anime_dict


## _Functions_

Defining a function `get_top_anime` to parse information from multiple pages,taking the argument number_of_pages as input from the user .We can now put together everything we've done so far to solve the problem to get an output with rows greater than 100 and columns greater than 3.Here a [zip](https://www.w3schools.com/python/ref_func_zip.asp) function is used to combine all the data belonging to a particular rank and create a dictionary of dictionaries for each rank.

In [28]:
def get_top_anime_doc(url):
    
    response=requests.get(url)
    if response.status_code != 200 :
        print("Status code:",response.status_code)
        raise exception ("Failed to fetch the webpage",header_base_url)
    anime_doc= BeautifulSoup(response.text,'html.parser')
    return anime_doc

def get_ratings(doc) :
    ratings_tag=doc.find_all('td',class_="score ac fs14")
    return [tag.span.text for tag in ratings_tag]


def get_anime_names(doc):
    anime_name_tags=doc.find_all('h3',class_="hoverinfo_trigger fl-l fs14 fw-b anime_ranking_h3")
    return [tag.text for tag in anime_name_tags]


def get_anime_links(doc):
    anime_link_tags=doc.find_all('h3',class_="hoverinfo_trigger fl-l fs14 fw-b anime_ranking_h3")
    return [tag.a['href'] for tag in anime_link_tags]


def get_ranks(doc):
    rank_tags=doc.find_all('td',class_='rank ac')
    return [tag.text.strip() for tag in rank_tags]


def get_top_anime (page_number):
    base_url="https://myanimelist.net/topanime.php"
    n=(page_number-1)*50
    url=base_url+"?limit={}".format(n)
    
    anime_doc=get_top_anime_doc(url)
    ranks_list=get_ranks(anime_doc)
    anime_names_list=get_anime_names(anime_doc)
    anime_links_list=get_anime_links(anime_doc)
    ratings_list=get_ratings(anime_doc)
    anime_dict={}
    for rank,name,link,rating in zip(ranks_list,anime_names_list,anime_links_list,ratings_list):
                anime_dict[rank]={
                    "Rank":rank,
                    "Anime_name": name,
                    "Anime_link":link,
                    "Ratings":rating
                }
            
    return anime_dict

A dictionary containing 200 rows of data and 4 columns can be obtained using the above reusable functions and the number of pages can be increased or decreased  using the range in the for loop.Following is the data from page_number(2)

In [29]:
get_top_anime(2)

{'51': {'Rank': '51',
  'Anime_name': 'Odd Taxi',
  'Anime_link': 'https://myanimelist.net/anime/46102/Odd_Taxi',
  'Ratings': '8.70'},
 '52': {'Rank': '52',
  'Anime_name': 'Vinland Saga Season 2',
  'Anime_link': 'https://myanimelist.net/anime/49387/Vinland_Saga_Season_2',
  'Ratings': '8.70'},
 '53': {'Rank': '53',
  'Anime_name': 'Code Geass: Hangyaku no Lelouch',
  'Anime_link': 'https://myanimelist.net/anime/1575/Code_Geass__Hangyaku_no_Lelouch',
  'Ratings': '8.70'},
 '54': {'Rank': '54',
  'Anime_name': "Fate/stay night Movie: Heaven's Feel - III. Spring Song",
  'Anime_link': 'https://myanimelist.net/anime/33050/Fate_stay_night_Movie__Heavens_Feel_-_III_Spring_Song',
  'Ratings': '8.69'},
 '55': {'Rank': '55',
  'Anime_name': 'Great Teacher Onizuka',
  'Anime_link': 'https://myanimelist.net/anime/245/Great_Teacher_Onizuka',
  'Ratings': '8.69'},
 '56': {'Rank': '56',
  'Anime_name': 'One Piece',
  'Anime_link': 'https://myanimelist.net/anime/21/One_Piece',
  'Ratings': '8.69'}

We get dictionary containing the information parsed from the page_number given by the user using `get_top_anime` function.We merge the values in multiple dictionary using a for loop to get a dictionary `top_animes_dict` containing 200 rows and 4 columns.Click on [dict](https://datagy.io/python-merge-dictionaries/) to see 7 different ways in 2 dictionaries can be merged.

In [30]:
top_animes_dict={}
for i in range (1,5):
    anime=get_top_anime(i)
    top_animes_dict= top_animes_dict|anime

In [31]:
len (top_animes_dict)

200

##  _Creating a dataframe_


Putting list of  the values extracted above to a dictionary and coverting it into a dataframe using [Pandas](https://pandas.pydata.org/docs/) dataframe.A dictionary containing information about the top animes can be obtained.

In [32]:
!pip install pandas --upgrade --quiet

In [33]:
# importing pandas library
import pandas as pd

In [34]:
top_animes_df=pd.DataFrame(top_animes_dict.values())

In [35]:
top_animes_df.head()

Unnamed: 0,Rank,Anime_name,Anime_link,Ratings
0,1,Shingeki no Kyojin: The Final Season - Kankets...,https://myanimelist.net/anime/51535/Shingeki_n...,9.15
1,2,Fullmetal Alchemist: Brotherhood,https://myanimelist.net/anime/5114/Fullmetal_A...,9.11
2,3,Bleach: Sennen Kessen-hen,https://myanimelist.net/anime/41467/Bleach__Se...,9.09
3,4,Steins;Gate,https://myanimelist.net/anime/9253/Steins_Gate,9.08
4,5,Gintama°,https://myanimelist.net/anime/28977/Gintama°,9.07


Displaying the first 5 rows of the dataframe using the head function.

## _Converting into a CSV file_

Lets write the top_animes_df dataframe  into a csv file.

In [36]:
top_animes_df.to_csv("top_animes_list.csv",index=None)

`top_animes_list.csv`  contains the information about the top_animes upto rank 200. First 10 values within the CSV files  is printed for reference using the head function to understand the structure of the information within the CSV file.

In [37]:
!head top_animes_list.csv

Rank,Anime_name,Anime_link,Ratings
1,Shingeki no Kyojin: The Final Season - Kanketsu-hen,https://myanimelist.net/anime/51535/Shingeki_no_Kyojin__The_Final_Season_-_Kanketsu-hen,9.15
2,Fullmetal Alchemist: Brotherhood,https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood,9.11
3,Bleach: Sennen Kessen-hen,https://myanimelist.net/anime/41467/Bleach__Sennen_Kessen-hen,9.09
4,Steins;Gate,https://myanimelist.net/anime/9253/Steins_Gate,9.08
5,Gintama°,https://myanimelist.net/anime/28977/Gintama°,9.07
6,Kaguya-sama wa Kokurasetai: Ultra Romantic,https://myanimelist.net/anime/43608/Kaguya-sama_wa_Kokurasetai__Ultra_Romantic,9.06
7,Shingeki no Kyojin Season 3 Part 2,https://myanimelist.net/anime/38524/Shingeki_no_Kyojin_Season_3_Part_2,9.06
8,Gintama: The Final,https://myanimelist.net/anime/39486/Gintama__The_Final,9.05
9,Gintama',https://myanimelist.net/anime/9969/Gintama,9.04


## _Summary_


In this project we have done the following tasks:                          
1)Downloading the webpage using Requests.                                                       
2)Extracting information from the HTML source code of a webpage programmatically, using the BeautifulSoup library.                                                                                          
3)Creating a BeautifulSoup object to parse the content within the Source code.                    
4)Compiling the extracted information into python lists and dictionaries.                        
5)Creating a dataframe of the  webpage using pandas .                                      
6)Converting the pandas dataframe to a csv file.                                               


The CSV file contains data in the following format:                                                 
`Rank,Anime_Name,Anime_Link,Ratings
1,Shingeki no Kyojin: The Final Season - Kanketsu-hen,https://myanimelist.net/anime/51535/Shingeki_no_Kyojin__The_Final_Season_-_Kanketsu-hen,9.17
2,Fullmetal Alchemist: Brotherhood,https://myanimelist.net/anime/5114/Fullmetal_Alchemist__Brotherhood,9.11`

## _Future Works_



We can scrape web page to obtain information on the different types of categories such as  Top Airing ,Top Upcoming ,Top TV Series, Top Movies, Top OVAs('Original Video Animation) ,Top ONAs (Original Net Animation),Top Specials ,Most Popular, Most Favourited animes .The project can be carried out further to analyze the web page to obtain information on the individual animes .The information that can be collected can be Details ,Characters & Staff ,Episodes, Videos, Stats, Reviews ,Recommendations ,Interest, Stacks ,News, Forum ,Clubs, Pictures on each animes.

The data collected can be used to analyse the genre of the anime that  most people are attracted to,the trend in the number of viewers in relation to the  number of episodes .The age group or gender along which a particular anime is popular based on the review by viewers and  the number of likes provided by the user.Also the kind of way in an Anime is preferred by the user as in the form of movies or Series.