# DS 3000 HW 5 

Due: Sunday July 20th @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file and the a `PDF` file included with the coding results to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted files represent your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the files to gradescope. 

**Notice that this is a group assignment. Each group only need to submit one copy and when you submit the work, please include everyone in your group.**

### Tips for success
- Start early
- Make use of Piazza
- Make use of Office hour
- Remember to use cells and headings to make the notebook easy to read (if a grader cannot find the answer to a problem, you will receive no points for it)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), though you are welcome to **talk about** (not show each other) the problems.

## Project proposal

For this course, we aim to complete a data analysis project about the the game [Palworld](https://en.wikipedia.org/wiki/Palworld). To help you start with the project, here are a couple of things you need to consider and work on to get a clean data for later analysis. 

To start with the project, please take some time to get familiar with the game. You don't need to play it but please at least know the basic terminologies, like what is a Pal. (And also, if you do play it, please do not spend too much time on it.)

The two recommended database is [https://palworld.gg/](https://palworld.gg/) and [https://paldb.cc/en/](https://paldb.cc/en/). You can use either, or both, or some other database about the Palworld. 

### Part 1.1 (10 points)

Please list 2-3 questions you may be interested to study with the Palworld database. It can be anything related in the game, like the Pals, items or constructions. Some potential question structures can be: 
- Are `A` and `B` related? How they are related?
- Which features may affect `C`'s change?
- If I need a higher `D`, which features may have a lower/higher value?
- Based on `E` and `F`, which items/pals are similar?
- I need to predict the value for `G`, which features I need to consider?

Do tankier pals have weaker attacks?

Based on rarity and price, what items are the most similar?

Do more expensive pals have higher stats?

### Part 1.2 (20 points)

Based on the questions we proposed in the part 1.1, what features we may need to include in the analysis? Check the websites, which website has those information? **You need to pick at least 8 features for analysis.** We recommend a mix of numerical (numbers etc.) and categorical (level etc.) features. Is there any other features that you think it may be important but hard to extract or find on the website (can be something in or not in the game)?

Rarity, Weight, Rank, Price, Defense, Shot Attacks, HP, Melee Attacks, Running Speed and sprinting speed

### Part 1.3 (20 points)

Suppose you do have all the features you mentioned in part 1.2. List 3-4 data visulizations you can make with those features. You do not need to make those visulizations here. Just describe the type of the visualizations (histogram, scatter plot etc. ), which features are involved, will there any hover data or color being added, and **discuss how these data visualizations may be related (or even answer) to your questions in part 1.1**. 

Merge defense and HP, making it X. Also merge shot attack and melee attack, making it Y, visualize their correlation and run linear regression to look for a correlation.

Use KNN classification to find which items are most similar based on rarity and price, taking rarity as x and price as y. Find out which items are most similar. Create a color labeled scatterplot.

We can do a scatterplot that take the price of pals and their stats as x and y

### Part 1.4  (50 points)

Now, go ahead and try to scrape the features you need. 

Please show all the codes you have for web scrapping. Your current output data frame should include at least 4 features. (You do not need to scrape all features at this moment, although it is recommend to start earlier. Also, you can choose to not to use the ones you have scraped in the later analysis. No need to worry if you need to change anything later). **Please design your code in pipeline and clearly document each function.** See the Python Style Guide in Week 1 for proper documentation. It is also recommended to save the data you have scrapped. 

In [8]:
# Get all pals from palworld.gg

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
from bs4 import BeautifulSoup
import requests

In [9]:
def get_url_html(url):
    """
    Returns HTML from a url

    Args:
        url (str): a url

    Returns:
        html (str): HTML response from url
    """
    
    html = requests.get(url)
    return html.text

In [14]:
def clean_pal_data(html):
    """
    Scrapes pals from palworld.gg and creates a dataframe with names, HP, defense, melee attack, and shot attack statistics

    Args:
        html (str): html text object from palworld.gg

    Returns:
        clean_df (DataFrame): DataFrame with columns
                                Name (str): name of the pal
                                HP (int): health points of the pal
                                Defense (int): defense statistics of the pal
                                Melee Attack (int): melee attack statistics of the pal
                                Shot Attack (int): shot attack statistics of the pal
    """
    # create soup object with the html text object
    soup = BeautifulSoup(html, 'html.parser')

    # find all divs that have a class name of 'pal'
    all_pals = soup.find_all('div', class_='pal')

    # initialize feature lists
    pal_names = []
    pal_HP = []
    pal_defense = []
    pal_melee_attack = []
    pal_shot_attack = []

    # iterate through each pal to get name and href
    for pal in all_pals[:10]:
        if pal.find('a'):
            href = pal.find('a')['href']    
            
        # build pal url
        pal_url = 'https://palworld.gg' + href
        
        # get html from the pal url
        pal_html = get_url_html(pal_url)

        # create soup object with the html text object from the pal url
        pal_soup = BeautifulSoup(pal_html)

        # find pal statistics by finding div with class name of 'stats'
        stats_section = pal_soup.find('div', class_='stats')
        all_stat_items = stats_section.find_all('div', class_='item')
        
        pal_names.append(href.split('/')[-1])
    
        # get all stats
        for stat in all_stat_items:
            if 'HP' in stat.text:
                pal_HP.append(stat.find('div', class_ = 'value').text)
            elif 'Defense' in stat.text:
                pal_defense.append(stat.find('div', class_ = 'value').text)
            elif 'Melee Attack' in stat.text:
                pal_melee_attack.append(stat.find('div', class_ = 'value').text)
            elif 'Shot Attack' in stat.text:
                pal_shot_attack.append(stat.find('div', class_ = 'value').text)

    # create DataFrame
    all_pal_data = pd.DataFrame({
        'Name': pal_names,
        'HP': pal_HP,
        'Defense': pal_defense,
        'Melee Attack': pal_melee_attack,
        'Shot Attack': pal_shot_attack
    })
    return all_pal_data

In [16]:
html = get_url_html('https://palworld.gg/pals')

clean_pal_data(html)

Unnamed: 0,Name,HP,Defense,Melee Attack,Shot Attack
0,anubis,120,100,130,130
1,arsox,85,95,100,95
2,astegon,100,125,100,125
3,azurmane,130,110,100,120
4,azurobe,110,100,70,100
5,azurobe-cryst,115,105,100,105
6,bastigor,140,120,100,130
7,beakon,105,80,100,115
8,beegarde,80,90,100,90
9,bellanoir,120,100,100,150
