# <p style="text-align: center;"> <b> Data Collecting </b></p>
---

<div class="list-group" id="list-tab" role="tablist">
    <h3 style="text-align: left; background-color: #EDC0C7; font-family:newtimeroman; color: black; padding: 14px; line-height: 1; border-radius:10px"><b>Table of Contents ✍️</b></h3>
    
- [1. Introduction](#introduction)
    - [1.1 Requirements](#requirements)
    - [1.2 Data Sources](#data_sources)
    - [1.3 Objectives](#objectives)
    - [1.4 Methodology](#methodology)
- [2. Implementation](#implementation)
    - [Import libraries](#import-libraries)
    - [Collect urls of the top 10000 manga](#collect_urls)
    - [Collect data of each manga](#collect_data)
    - [Save the data](#save_data)
    

<a class="anchor" id="introduction"></a>
## <div style="text-align: left; background-color:#EDC0C7; font-family:newtimeroman;color: black; padding: 14px; line-height: 1;border-radius:10px">1. Introduction</div>


<a class="anchor" id="requirements"></a>

## <span style='color:#2B9C15 '> 1.1 Requirements  </span>
- Our group will be responsible for collecting and preparing the data for analysis and exploration.
- The data will be collected from the Internet by parsing HTML code or using APIs. Our group are not allowed to use available datasets.
- Our dataset must have at least 5 fields and 1000 observations.

<a class="anchor" id="data_sources"></a>

## <span style='color:#2B9C15 '> 1.2 Data Sources  </span>
- **Topic**: Manga (Japanese comics)
- **Website**: [MyAnimeList](https://myanimelist.net/)

<a class="anchor" id="objectives"></a>

## <span style='color:#2B9C15 '> 1.3 Objectives  </span>
- Crawl data of the top 10000 manga on MyAnimeList.

- Data attributes:
    - Title: title of the manga.
    - Score: average score of the manga.
    - Vote: number of votes for the manga.
    - Ranked: rank of the manga.
    - Popularity: popularity of the manga.
    - Members: number of members who have added this manga to their list.
    - Favorite: number of members who have this manga in their list.
    - Volumes: number of volumes of the manga.
    - Chapters: number of chapters of the manga.
    - Status: status of the manga (Publishing, Finished).
    - Published: date of publication of the manga.
    - Genres: genres of the manga.
    - Themes: themes of the manga.
    - Author: author of the manga.
    - Total Review: total reviews of the manga.
    - Type Review: type of reviews of the manga.
    
- Save the data in a CSV file.

<a class="anchor" id="methodology"></a>

## <span style='color:#2B9C15 '> 1.4 Methodology  </span>
1. Start at the [Top Manga page](https://myanimelist.net/topmanga.php?limit=0) and crawl the urls of the top 10000 manga by changing the limit parameter.

2. Iterate through the urls and crawl the html code of each manga.

3. Parse the html code of each manga to get the data.

4. Save the data in a CSV file.

<a class="anchor" id="implementation"></a>
## <div style="text-align: left; background-color:#EDC0C7; font-family:newtimeroman;color: black; padding: 14px; line-height: 1;border-radius:10px">2. Implementation</div>

<a class="anchor" id="import_libraries"></a>

## <span style='color:#2B9C15 '> 📕 Import libraries </span>
👉 The following libraries are used in this notebook:
- `requests`: Used to make HTTP requests to the website and get the HTML content

- `HTMLSession`: Used to get the HTML content of the dynamic website (the website uses JavaScript to render the content).  

- `BeautifulSoup`: Used for parsing HTML code, extracting data from HTML code.

- `re`: Used to pattern matching and extracting data from HTML content using regular expressions.

- `nest_asyncio`: enables the use of asynchronous code in Jupyter notebooks, addressing compatibility issues with asyncio.

- `pandas`: Used to organize and manipulate the extracted data, creating a structured DataFrame,, which are tabular data structures.

- `datetime`: Used to timestamp the data collection process, providing information about when the data was collected.

- `time`: Used to pause the execution of the code for a specified amount of time.

In [1]:
import requests
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import re
import nest_asyncio
import pandas as pd 
import datetime
import time

<a class="anchor" id="collect_urls"></a>

## <span style='color:#2B9C15 '> 📕 Collect urls of the top 10000 manga </span>
👉 Steps to collect the urls:

1. Send a GET request to the [Top Manga page](https://myanimelist.net/topmanga.php?limit=0) to get the HTML content of the page.

2. Parse the HTML content using `BeautifulSoup`.

3. Find and extract the urls from the HTML content.

4. Save the urls in a list.

5. Repeat the above steps by increasing the limit parameter by 50 each time until the limit parameter reaches 10000.

The collecting process is split into 2 parts, each part collects 5000 urls to avoid the connection being interrupted by the website due to too many requests.

- Applying `nest_asyncio` to avoid issues that may arise from nested event loops when working with asynchronous code.

- Creating an HTML Session using the `HTMLSession` class from the `requests_html` library.

In [2]:
nest_asyncio.apply() 
session = HTMLSession()

### 👉 Crawl the first 5000 urls

In [3]:
listUrl1 = []

for i in range(0, 5000, 50):
    # Url of the website to scrap
    url = f'https://myanimelist.net/topmanga.php?limit={i}'

    # Get the html content
    html = requests.get(url).text

    # Parse the html content
    soup = BeautifulSoup(html, "html.parser")

    # Get the list of manga
    listItem = soup.find_all("td", {"class": "title al va-t clearfix word-break"})

    # Get the url of each manga
    for item in listItem:
        listUrl1.append(item.find('a').get('href'))

    # Print the number of manga urls collected
    print(f'{len(listUrl1)} urls collected', end='\r', flush=True)


5000 urls collected

### 👉 Crawl the remaining 5000 urls

- Similar to data collection above

In [4]:
listUrl2 = []

for i in range(5000,10000,50):
    # Url of the website to scrap
    url = f'https://myanimelist.net/topmanga.php?limit={i}'

    # Get the html content
    html = requests.get(url).text

    # Parse the html content
    soup = BeautifulSoup(html, "html.parser")

    # Get the list of manga
    listItem = soup.find_all("td", {"class": "title al va-t clearfix word-break"})

    # Get the url of each manga
    for item in listItem:
        listUrl2.append(item.find('a').get('href'))

    # Print the number of manga urls collected
    print(f'{len(listUrl2)} urls collected', end='\r', flush=True)

5000 urls collected

### 👉 Concatenate 2 list urls

In [5]:
listUrl = listUrl1 + listUrl2
print(f'Total: {len(listUrl)} urls collected')

Total: 10000 urls collected


<a class="anchor" id="collect_data"></a>

## <span style='color:#2B9C15 '> 📕 Collect data of each manga  </span>
1. From each url collected above, send a GET request to get the HTML content of the page.
2. If length of the HTML content is smaller than 4000 , sleep for 10 seconds and send the GET request again. Because that means the website has blocked the connection and we need to wait for a while before sending the request again.
3. Save the HTML content in a list for parsing later.

This process still splits into 2 parts, each part collects 5000 HTML contents to avoid the connection being interrupted by the website due to too many requests.

### 👉 Crawl HTML content from the first 5000 manga URLs

In [9]:
listHtml1 = []

for url in listUrl[:5000]:
    res = session.get(url)
    while len(res.text) < 4000:
        # Sleep for 10 minutes
        time.sleep(600)
        res = session.get(url)
        
    listHtml1.append(res.text)

    # Print the number of manga html collected
    print(f'{len(listHtml1)}/{len(listUrl)} manga html collected', end='\r', flush=True)

5000/10000 manga html collected

### 👉 Crawl HTML content from the remaining 5000 manga URLs
- Similar to data collection above

In [10]:
listHtml2 = []

for url in listUrl[5000:]:
    res = session.get(url)
    while len(res.text) < 4000:
        # Sleep for 10 minutes
        time.sleep(600)
        res = session.get(url)
        
    listHtml2.append(res.text)

    # Print the number of manga html collected
    print(f'{len(listHtml2)+5000}/{len(listUrl)} manga html collected', end='\r', flush=True)

10000/10000 manga html collected

In [11]:
# Extract time of data collection to report for the project
now = datetime.datetime.now()
now = now.strftime("%Y-%m-%d")
print("Time of data collection: ", now)

Time of data collection:  2023-12-05


### 👉 Concatenate 2 list htmls

In [12]:
listHtml = listHtml1 + listHtml2
print(f'Total: {len(listHtml)} manga html collected')

Total: 10000 manga html collected


### 👉 Extracting the detailed values of each comic website page

1. Parsing HTML Content: The function starts by using BeautifulSoup to parse the HTML content of a comic page

2. Extracting Title:
    - The title of the comic is extracted using `soup.find('span', {'itemprop': 'name'})`
    - If the title is not found (i.e., None), the function returns None to indicate that the information couldn't be extracted

3. Handling English Title (title-english):
    - If an English title is present (indicated by the presence of a title-english span), it is extracted and removed from the main title. The resulting title is a combination of the original title and the English title enclosed in parentheses

4. Extracting Rating Information:
    - The rating information is extracted using `soup.find('span', {'itemprop': 'ratingValue'}).text` and `soup.find('span', {'itemprop': 'ratingCount'}).text`
    - These represent the rating value and the count of ratings, respectively

5. Extracting Rank and Popularity:
    - The rank and popularity are extracted using regular expressions `(re.findall)`
    - The regular expression `r'\d+'` is used to find all sequences of digits in the text, and [0] is used to select the first match

6. Looping Through Information Sections:
    - The function iterates through the information sections of the manga page, represented by `div` elements with the `class 'spaceit_pad'`

7. Extracting Manga Details:
    - For each section, it checks the content and extracts relevant details such as `volumes`, `chapters`, `status`, `published date`, `genres`, `themes`, `authors`, `favorites`, and `members`
    - The extracted information is stored in the respective variables

8. Extracting Review Information:
    - The function then moves to the `'manga-info-review__header'` section to extract information related to reviews
    - It retrieves the `total number` of reviews and the number of reviews for each type (`recommended`, `mixed-feelings`, `not-recommended`)

9. Returning a Dictionary:  
    - The function compiles all the extracted information into a dictionary and convert to Pandas




In [13]:
def extract_info(htmlComic):
    soup = BeautifulSoup(htmlComic, "html.parser")

    title = soup.find('span', {'itemprop': 'name'})
    if title is None:
        return None
    else:
        title_text = title.text.strip()
        title_english_span = title.find('span', {'class': 'title-english'})

        if title_english_span is not None:
            title_english_text = title_english_span.text.strip()
            title_text = title_text.replace(title_english_text, '')
            title = f'{title_text} ({title_english_text})'
        else:
            title = title_text
    ratingValue = soup.find('span', {'itemprop': 'ratingValue'}).text
    ratingCount = soup.find('span', {'itemprop': 'ratingCount'}).text
    ranked = re.findall(r'\d+', soup.find('span', {'class': 'numbers ranked'}).text)[0]
    popularity = re.findall(r'\d+', soup.find('span', {'class': 'numbers popularity'}).text)[0]

    volumes, chapters, status, published = '', '', '', ''
    genres, themes, authors, favorites, members = [], [], '', '', ''

    for space in soup.find_all("div", {'class': 'spaceit_pad'}):
        text = space.text
        if 'Volumes' in text:
            volumes = text.split(':')[1].strip()
        elif 'Chapters' in text:
            chapters = text.split(':')[1].strip()
        elif 'Status' in text:
            status = text.split(':')[1].strip()
        elif 'Published' in text:
            published = text.split(':')[1].strip()
        elif 'Genres' in text:
            genres = [gen.text for gen in space.find_all('span', {'itemprop': 'genre'})]
        elif 'Themes' in text:
            themes = [theme.text for theme in space.find_all('span', {'itemprop': 'genre'})]
        elif 'Authors' in text:
            authors = text.split(':')[1].strip()
        elif 'Favorites' in text:
            favorites = text.split(':')[1].strip()
        elif 'Members' in text:
            members = text.split(':')[1].strip()

    infoReviews = soup.find('div', {'class': 'manga-info-review__header mal-navbar'})
    totalReviews = re.findall(r'\d+', infoReviews.find('div', {'class': 'right'}).text)[0]

    typeReview = [
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'recommended'}).text)[0]),
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'mixed-feelings'}).text)[0]),
        int(re.findall(r'\d+', infoReviews.find('div', {'class': 'not-recommended'}).text)[0])
    ]

    return {
        "Title": title, "Score": ratingValue, "Vote": ratingCount,
        "Ranked": ranked, "Popularity": popularity, "Members": members,
        "Favorite": favorites, "Volumes": volumes, "Chapters": chapters,
        "Status": status, "Published": published, "Genres": genres,
        "Themes": themes, "Author": authors, "Total Review": totalReviews,
        "Type Review": typeReview
    }

data_list = [extract_info(htmlComic) for htmlComic in listHtml if extract_info(htmlComic) is not None]
df = pd.DataFrame(data_list)

### 👉 Returns the first 5 rows to check the data

In [14]:
df.head()

Unnamed: 0,Title,Score,Vote,Ranked,Popularity,Members,Favorite,Volumes,Chapters,Status,Published,Genres,Themes,Author,Total Review,Type Review
0,Berserk,9.47,331288,1,1,665300,122841,Unknown,Unknown,Publishing,"Aug 25, 1989 to ?","[Action, Adventure, Award Winning, Drama, Fant...","[Gore, Military, Mythology, Psychological]","Miura, Kentarou (Story & Art), Studio Gaga (Art)",258,"[233, 15, 10]"
1,JoJo no Kimyou na Bouken Part 7: Steel Ball Run,9.3,156368,2,26,256146,42864,24,96,Finished,"Jan 19, 2004 to Apr 19, 2011","[Action, Adventure, Mystery, Supernatural]",[],"Araki, Hirohiko (Story & Art)",128,"[120, 7, 1]"
2,Vagabond,9.24,136403,3,15,364891,40158,37,327,On Hiatus,"Sep 3, 1998 to May 21, 2015","[Action, Adventure, Award Winning]","[Historical, Samurai]","Inoue, Takehiko (Story & Art), Yoshikawa, Eiji...",97,"[88, 8, 1]"
3,One Piece,9.22,366668,4,3,599278,114531,Unknown,Unknown,Publishing,"Jul 22, 1997 to ?","[Action, Adventure, Fantasy]",[],"Oda, Eiichiro (Story & Art)",206,"[173, 17, 16]"
4,Monster,9.15,93945,5,29,236355,20501,18,162,Finished,"Dec 5, 1994 to Dec 20, 2001","[Award Winning, Drama, Mystery]","[Adult Cast, Psychological]","Urasawa, Naoki (Story & Art)",76,"[64, 7, 5]"


<a class="anchor" id="save_data"></a>
## <span style='color:#2B9C15 '> 📕 Save the data </span>

In [15]:
df.to_csv('../data/raw_comic.csv', index=False)

<div style="text-align: left; background-color:#EDC0C7; font-family:Arial; color:black; padding: 12px; line-height:1.25;border-radius:1px; margin-bottom: 0em; text-align: center; font-size: 30px;border-style: solid;border-color: black;">END</div>