## 1. Web Scraping

Modify the scripts we used in class to make a program to download both tables present in the  wikipedia page on the Anscombe's Quartet (https://en.wikipedia.org/wiki/Anscombe%27s_quartet). Each table should be saved in its own csv file. **Note:** Your file for the first table should contain the column names, the file for the second table does not need the column names.

In [48]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

def save_table_to_csv(table, filename, header=True, index=False):
    if header:
        table.to_csv(filename, index=index)
    else:
        table.to_csv(filename, header=False, index=index)

page = requests.get('https://en.wikipedia.org/wiki/Anscombe%27s_quartet')
data = page.text

soup = BeautifulSoup(data, 'html.parser')

# Create a list to store the tables
tables = []

for i, table in enumerate(soup.find_all("table")):
    if table.find('caption'):
        fullTable = []
        for tr in table.find_all('tr'):
            line = []
            if tr.find_all('th'):
                columnNames = [th.get_text().strip() for th in tr.find_all('th')]
            else:
                for td in tr.find_all('td'):
                    line.append(td.get_text().strip())
                fullTable.append(line)

        if columnNames and len(columnNames) == len(fullTable[0]):
            newTable = pd.DataFrame(fullTable, columns=columnNames)
        else:
            newTable = pd.DataFrame(fullTable)

        tables.append(newTable)

# Save the first table with column names
if len(tables) > 0:
    save_table_to_csv(tables[0], 'anscombe_table1.csv', header=True, index=False)

# Save the second table without column names if it exists
if len(tables) > 1:
    save_table_to_csv(tables[1], 'anscombe_table2.csv', header=False, index=False)



## 2. Pandas and Stats

The Iris dataset is one of the most famous datasets in statistics. Read about it in wikipedia: https://en.wikipedia.org/wiki/Iris_flower_data_set.

Download the dataset from the table in the wikipedia page using beatifulsoup or pandas, create a pandas dataframe containing the dataset (including column names). **Note:** The first column of the table contains only the order of the points in the dataset, it should become the index of your data frame.

In [49]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# URL of the Wikipedia page on the Iris flower dataset
wiki_url = 'https://en.wikipedia.org/wiki/Iris_flower_data_set'

# Send a GET request to the URL
response = requests.get(wiki_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table containing the Iris dataset
    tables = soup.find_all('table', {'class': 'wikitable'})
    
    # Assuming the Iris dataset table is the first table on the page
    iris_table = tables[0]

    # Read the table using pandas, considering the first column as the index
    iris_df = pd.read_html(str(iris_table), index_col=0)[0]

    print(iris_df.head())  # Display the first few rows of the DataFrame

else:
    print(f'Error: Unable to fetch the page (Status code: {response.status_code})')


               Sepal length  Sepal width  Petal length  Petal width    Species
Dataset order                                                                 
1                       5.1          3.5           1.4          0.2  I. setosa
2                       4.9          3.0           1.4          0.2  I. setosa
3                       4.7          3.2           1.3          0.2  I. setosa
4                       4.6          3.1           1.5          0.2  I. setosa
5                       5.0          3.6           1.4          0.3  I. setosa



Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.



Your dataframe might have string values in the columns, if so, you need to convert each of the columns that should contain numbers to numeric values (Check the function `pd.to_numeric`).

After converting the columns to numeric use the `desribe()` method to  calculate the average and standard deviation for each variable.

In [50]:
# Convert columns to numeric
numeric_columns = ['Sepal length', 'Sepal width', 'Petal length', 'Petal width']
iris_df[numeric_columns] = iris_df[numeric_columns].apply(pd.to_numeric, errors='coerce')

# Calculate average and standard deviation for each variable using describe()
description = iris_df.describe()

print(description)


       Sepal length  Sepal width  Petal length  Petal width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.200000
std        0.828066     0.435866      1.765298     0.761401
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


Use the `.groupby()` method to group the data by species and calculate the average and standard deviation for each variable based on the iris species.

In [51]:
# Group by species and calculate average and standard deviation
group_by_species = iris_df.groupby('Species')
summary_statistics = group_by_species.agg({'Sepal length': ['mean', 'std'],
                                               'Sepal width': ['mean', 'std'],
                                               'Petal length': ['mean', 'std'],
                                               'Petal width': ['mean', 'std']})

print(summary_statistics)

              Sepal length           Sepal width           Petal length  \
                      mean       std        mean       std         mean   
Species                                                                   
I. setosa            5.006  0.352490       3.428  0.379064        1.462   
I. versicolor        5.936  0.516171       2.770  0.313798        4.260   
I. virginica         6.588  0.635880       2.974  0.322497        5.552   

                        Petal width            
                    std        mean       std  
Species                                        
I. setosa      0.173664       0.248  0.105444  
I. versicolor  0.469911       1.326  0.197753  
I. virginica   0.551895       2.026  0.274650  


Make scatter plot showing the covariance of the variables. Check plotly's `create_scatterplotmatrix` function from the `figure_factory`. Your graph should look like this:

<img src="iris.png"></img>

In [52]:
import plotly.figure_factory as ff

# Use Plotly's create_scatterplotmatrix to create the scatter plot matrix
scatter_plot_matrix = ff.create_scatterplotmatrix(iris_df, diag='histogram', index='Species', height=800, width=800)

# Display the plot
scatter_plot_matrix.show()