# AstaGuru Job Application Assignment

## Assignment Questions

### Q1: How will you approach web scraping for auction data, considering potential ethical and legal considerations around copyright and data privacy? What strategies would you use to ensure responsible data collection?  Demonstrate through a working example  

#### Ethical and Legal Considerations
*Compliance with Terms of Service* : I would have to review the Terms of Service of the intended website and ensure scraping activities do not violate them. Respecting their rules is essential to avoid legal issues. 

*Copyright and Data Privacy* : I would ensure that the data collected is used in a manner that respects copyright laws and data privacy regulations (e.g., GDPR in Europe, Personal Data Protection Bill in India). I would avoid collecting personal information unless absolutely necessary, and if needed the personal information can also be made anonymous. 

*Rate Limiting* : I would implement rate limiting practices to avoid overwhelming the server, ensuring the website remains accessible to other users. This would be done by adding delays between requests to avoid making too many requests in a short period.

*Respectful Access* : I would use ethical scraping practices, such as setting a user agent string that identifies the scraper and including contact information in case the website administrators need to reach out. 

#### Strategies for Responsible Data Collection
*Robots.txt* : I would always check the robots.txt file of the website to understand which parts of the site can be legally scraped, and if it can be scraped at all. 

*API Access* : If the website provides an API, I would prefer using it over scraping HTML pages. APIs are designed for data access and are often more stable and reliable than webscraping. 

*Data Storage* : I would store data securely and ensure it is accessible only to authorized personnel. Encrypt sensitive data both in transit and at rest.

#### Example: I want to try and retrieve data about the most expensive paintings ever sold from Wikipedia. This table includes  the name, the image of the painting, the artist, the year of creation and year of sale etc. Here is the link to the page I want to scrape: https://en.wikipedia.org/wiki/List_of_most_expensive_paintings

In [26]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_most_expensive_paintings'

response = requests.get(url)

# Checking to see if the request was successful
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Finding the required table
    table = soup.find('table', class_='wikitable')

    paintings = []
    artists = []
    sale_prices = []
    sale_dates = []

    # Extracting the Data
    rows = table.find_all('tr')[1:]  
    for row in rows:
        columns = row.find_all('td')
        paintings.append(columns[1].text.strip())
        artists.append(columns[3].text.strip())
        sale_price = columns[0].text.strip().replace('~', '').replace('[note 3]', '').replace('[25]', '').replace('$', '').strip()
        sale_prices.append(sale_price)
        sale_dates.append(columns[5].text.strip())

    # Creating a DataFrame with the table
    data = pd.DataFrame({
        'Painting': paintings,
        'Artist': artists,
        'Sale Price': sale_prices,
        'Sale Date': sale_dates
    })

    # Saving Data to CSV File
    data.to_csv('most_expensive_paintings_cleaned.csv', index=False)

    print("Data scraped and saved to most_expensive_paintings_cleaned.csv")
else:
    print(f"Failed to retrieve data from {url}")

Data scraped and saved to most_expensive_paintings_cleaned.csv


In [24]:
data.head()

Unnamed: 0,Painting,Artist,Sale Price,Sale Date
0,Salvator Mundi,Leonardo da Vinci,450.3,"November 15, 2017"
1,Interchange,Willem de Kooning,300,September 2015
2,The Card Players,Paul Cézanne,250 +,April 2011
3,Nafea Faa Ipoipo(When Will You Marry?),Paul Gauguin,210,September 2014
4,Number 17A,Jackson Pollock,200,September 2015


### Q2:  How would you identify trends in buyer behavior, item valuation, and market demand for an auction house?

#### To identify trends, we need to analyze historical data from past auctions. This includes information on items sold, their final prices, buyer demographics, and more. Here are the steps and methodologies to achieve this:

#### 1. Data Collection
*Historical Auction Data* : I would collect data on past auctions, including item descriptions, final bid prices, auction dates, and buyer details.

*External Data* : I would gather additional market data, such as economic indicators, art market reports, and competitor analysis.

#### 2. Data Processing and Cleaning
*Data Cleaning* : This would involve standard data cleaning processes, such as identifying and removing null values, or removing duplicates. 

*Data Enrichment* : Additional data points could be included in the database, such as inflation-adjusted prices or conversion rates for international buyers.

#### 3. Data Analysis Techniques
*Descriptive Statistics* : I would calculate mean, median, and mode of auction prices to understand central tendencies. These are the most basic points of data analysis, but still incredibly important.

*Time Series Analysis* : Time series analysis can be leveraged to identify seasonal trends and cyclical patterns in auction prices.

*Regression Analysis* : I would employ regression models to determine factors influencing item valuation, such as artist popularity, condition of the item, provenance, etc. I would use correlation calculations to find any relationship between 2 such factors wherein investing 'x' amount in one area will cause a 'greater than x' increase in output/sales. 

*Cluster Analysis*: Segment buyers into clusters based on purchase behavior and demographics to identify distinct buyer personalities. This could be based on amount spent, the type of art chosen, the number of bids etc. - all these questions could be answered and explored using data analysis techniques. 

#### 4. Visualization
*Trend Graphs* : I would plot historical auction prices to visualize price trends over time.

*Heatmaps* : Heatmaps could be used to show demand for different categories of items, or to show the distribution of users in a certain country or the world.

*Buyer Segmentation* : Visualizations could be created to depict buyer clusters and their characteristics. This would allow us to customize our marketing procedures based on the characteristics of each buyer cluster

#### There are multiple ways to clean, analyze, and visualize historical data to understand trends, behaviors, market prices and demand within the industry. 

### THANK YOU SO MUCH
#### Aayush Damani