# Data Science - Class 1

### Exercise 1 - Data Collection
Use BeautifulSoup (Python) library to perform **web scraping**. Use the site: https://www.ayush.nz/technology

* requests.get(url, headers=headers) â€“ Fetches the HTML content of a webpage. 
    * Example: response = requests.get("https://example.com") 
* BeautifulSoup(html, 'html.parser') â€“ Parses HTML content. 
    * Example: soup = BeautifulSoup(response.text, 'html.parser') 
* soup.select('div.article-link') â€“ Selects elements using CSS selectors. 
    * Example: articles = soup.select('div.article-link')

In [1]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://www.ayush.nz/technology'  # URL of the page to scrape
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# HTML element
"""
<div class="article-link">
    <p>
        <a href="/2022/11/consuming-apis-responsibly" title="Consuming APIs responsibly">Consuming APIs responsibly</a><span class="muted"> / Nov 2022</span>
    </p>
    <div class="excerpt">
        Or: Etiquette and table manners for pinging other people's servers.
        <img src="https://www.ayush.nz/static/images/img-normal/2022-11-21-consuming-apis-responsibly.png" alt="Banner image for Consuming APIs responsibly">
    </div>
</div>
"""

response = requests.get(url, headers=headers)

if response.status_code != 200:
    print(f"Failed to retrieve page. Status code: {response.status_code}")
    exit()

# Content parsing
html = BeautifulSoup(response.text, 'html.parser')
articles = html.select('div.article-link')  # Select article containers

my_data = []

# Important function!
for article in articles:
    try:
        # Extract title from <a> tag
        title_tag = article.find('a')
        title = title_tag.get_text(strip=True)
        url = title_tag['href']

        # Extract date from the muted span
        date = article.find('span', class_='muted').get_text(strip=True).replace('/', '').strip()
     
        # Extract excerpt (some articles have images in excerpts)
        excerpt_div = article.find('div', class_='excerpt')
        if excerpt_div:
            # Remove any images from excerpt
            for img in excerpt_div.find_all('img'):
                img.decompose()
            excerpt = excerpt_div.get_text(strip=True)
        else:
            excerpt = "No excerpt available"

        # Append the data to the list
        my_data.append({"title": title, "url":url, "date":date, "excerpt":excerpt})
        
    except Exception as e:
        print(f"Error processing article: {e}")
        continue

pprint(my_data)

[{'date': 'Nov 2022',
  'excerpt': "Or: Etiquette and table manners for pinging other people's "
             'servers.',
  'title': 'Consuming APIs responsibly',
  'url': '/2022/11/consuming-apis-responsibly'},
 {'date': 'Feb 2022',
  'excerpt': 'Add Jekyll posts into a series with series navigation.',
  'title': 'Create a series of posts with navigation in Jekyll',
  'url': '/2022/02/creating-article-series-posts-navigation-jekyll'},
 {'date': 'Jan 2022',
  'excerpt': 'Implementing light and dark mode on your Bootstrap 5 + Jekyll '
             'website.',
  'title': 'A practical guide to light and dark mode in Bootstrap 5 and Jekyll',
  'url': '/2022/01/practical-light-dark-mode-jekyll-bootstrap5'},
 {'date': 'Jan 2022',
  'excerpt': 'Celebrate 2022 with a shiny new file manager for Ubuntu!',
  'title': 'Nemo - The Ubuntu file manager you didnâ€™t know you needed',
  'url': '/2022/01/nemo-file-manager-ubuntu-20.04-linux-nautilus-alternative'},
 {'date': 'Jan 2022',
  'excerpt': "I'v

### Exercise 2 - API Usage
Use the public API endpoint to request data: https://jsonplaceholder.typicode.com/posts, using requests and NumPy libraries for Python, and httr and jsonlite in R, to save an array with all userId.

* requests.get(api_url) â€“ Sends a GET request to an API. 
    * Example: response = requests.get("https://jsonplaceholder.typicode.com/posts") 
* response.json() â€“ Parses the API response as JSON. 
    * Example: data = response.json() 
* np.array(data) â€“ Converts a list into a NumPy array. 
    * Example: array_data = np.array([1, 2, 3]) 
* np.savetxt(file_path, array, delimiter=',', fmt='%d') â€“ Saves an array to a CSV file. 
    * Example: np.savetxt("output.csv", array_data, delimiter=",")

In [2]:
import requests
import numpy as np

# Define the API endpoint
api_url = "https://jsonplaceholder.typicode.com/posts"

# Example of the data:
#  {
#    "userId": 1,
#    "id": 1,
#    "title": "sunt aut facere repellat provident occaecati excepturi optio reprehenderit",
#    "body": "quia et suscipit\nsuscipit recusandae consequuntur expedita et cum\nreprehenderit molestiae ut ut quas totam\nnostrum rerum est autem sunt rem eveniet architecto"
#  },
#  {
#    "userId": 1,
#    "id": 2,
#    "title": "qui est esse",
#    "body": "est rerum tempore vitae\nsequi sint nihil reprehenderit dolor beatae ea dolores neque\nfugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\nqui aperiam non debitis possimus qui neque nisi nulla"
#  },
#  {
#    "userId": 1,
#    "id": 3,
#    "title": "ea molestias quasi exercitationem repellat qui ipsa sit aut",
#    "body": "et iusto sed quo iure\nvoluptatem occaecati omnis eligendi aut ad\nvoluptatem doloribus vel accusantium quis pariatur\nmolestiae porro eius odio et labore et velit aut"
#  }, ...

try:
    # Send a GET request to the API
    response = requests.get(api_url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the JSON response
        data = response.json()

        # Extract the userIds
        user_ids = [post['userId'] for post in data]

        # Convert the list to a NumPy array
        users_ids_array = np.array(user_ids)

        # Specify the file path where you want to save the CSV file
        file_path = 'api_saved_ids.csv'

        # Save the NumPy array as a CSV file
        api_saved_ids_csv = np.savetxt(file_path, users_ids_array, delimiter=',', fmt='%d')

        # Print the array
        print("Array de ids:", users_ids_array)

        # Calculate the mean of user IDs
        np.mean(users_ids_array)


    else:
        print("Error: Unable to fetch data from the API. Status code:", response.status_code)

except Exception as e:
    print("An error occurred:", str(e))

Array de ids: [ 1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  2  3  3  3  3
  3  3  3  3  3  3  4  4  4  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5
  5  5  6  6  6  6  6  6  6  6  6  6  7  7  7  7  7  7  7  7  7  7  8  8
  8  8  8  8  8  8  8  8  9  9  9  9  9  9  9  9  9  9 10 10 10 10 10 10
 10 10 10 10]


### Exercise 3 - Data Load
Given the dataset diabetes.csv, load the data using pandas (Python) and the read function in R, sort by column â€˜BloodPressureâ€™, and then save the resulting file in CSV.

* pd.read_csv(file_path) â€“ Reads a CSV file into a Pandas DataFrame. 
    * Example: df = pd.read_csv("data.csv") 
* df.sort_values(by='column_name') â€“ Sorts DataFrame by a column. 
    * Example: sorted_df = df.sort_values(by='Age') 
* df.to_csv('output.csv', index=False) â€“ Saves DataFrame to a CSV file. 
    * Example: df.to_csv("sorted.csv", index=False)

In [4]:
import pandas as pd

def load_and_manipulate_data(file_path):
    # Load the dataset
    try:
        # Load the dataframe
        df = pd.read_csv("diabetes.csv")

        # Manipulate the data to sort by column: BloodPressure
        sorted_df = df.sort_values(by='BloodPressure')

        # Save the manipulated data to a new CSV file
        save_df = df.to_csv("sorted.csv", index=False)

    except FileNotFoundError:
        print(f"Error: File not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

# Call the function to load and manipulate the data after the user specify the path to the file
load_and_manipulate_data(input("Enter the path to the CSV file: ")) # Since the file is in the same folder as the script just type: diabetes.csv

Error: File not found.


### Exercise 4 - SQL Database Interaction
Create the following databases, insert the following data, and print the tables using the SQL libraries SQLite (Python) and RSQLite (R).

* sqlite3.connect(db_file) â€“ Establishes a SQLite database connection. 
    * Example: conn = sqlite3.connect("database.db") 
* cursor.execute("CREATE TABLE ...") â€“ Executes SQL commands. 
    * Example: cursor.execute("CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT)") 
* cursor.execute("INSERT INTO table VALUES (...)") â€“ Inserts data. 
    * Example: cursor.execute("INSERT INTO users (name) VALUES ('Alice')") 
* cursor.fetchall() â€“ Fetches all rows from a query result. 
    * Example: data = cursor.fetchall()

In [None]:
import sqlite3

# Function to create tables
def create_tables(conn):
    cursor = conn.cursor()

    # Create the "students" table
    cursor.execute('''
        CREATE TABLE students (
            student_id INTEGER PRIMARY KEY,
            name TEXT NOT NULL,
            age INTEGER
        )
    ''')

    # Create the "grades" table with a foreign key reference to students
    cursor.execute('''
        CREATE TABLE grades (
            grade_id INTEGER PRIMARY KEY,
            subject TEXT NOT NULL,
            grade INTEGER,
            student_id INTEGER,
            FOREIGN KEY (student_id) REFERENCES students (student_id)
        )
    ''')

    conn.commit()

# Function to insert data into tables
def insert_data(conn):
    cursor = conn.cursor()
    cursor.execute("INSERT INTO students (student_id, name, age) VALUES (1, 'John Doe', 20)")
    cursor.execute("INSERT INTO grades (grade_id, subject, grade, student_id) VALUES (1, 'Math', 90, 1)")

    # Commit the data insertion
    conn.commit()

# Function to print the contents of tables
def print_tables(conn):
    cursor = conn.cursor()

    # Print the "students" table
    cursor.execute("SELECT * FROM students")
    print("\nStudents Table:")
    print(cursor.fetchall())

    # Print the "grades" table
    cursor.execute("SELECT * FROM grades")
    print("\nGrades Table:")
    print(cursor.fetchall())

# Connect to the SQLite database (or create a new one if not exists)
db_file_path = "school_database.db"
conn = sqlite3.connect(db_file_path)

create_tables(conn)
insert_data(conn)
print_tables(conn)

# Close the database file
conn.close()


Students Table:
[(1, 'John Doe', 20)]

Grades Table:
[(1, 'Math', 90, 1)]


### Exercise 5 - Data Serialization
Create a random multidimensional array and then serialize and deserialize the array, using Pickle library.

* pickle.dump(object, file) â€“ Serializes an object. 
    * Example: pickle.dump(data, open("data.pkl", "wb")) 
* pickle.load(file) â€“ Deserializes an object. 
    * Example: data = pickle.load(open("data.pkl", "rb"))

In [6]:
#%% 5- serialize and deserialize a randomly created array
import numpy as np
import pickle

# Define a seed
np.random.seed(42)

# Function to create a random NumPy array
def create_random_array(shape):
    return np.random.random(shape)

# Function to serialize and deserialize the NumPy array using pickle
def serialize_and_deserialize(array):
    # Serialize the array
    with open('serialized_array.pkl', 'wb') as file:
        pickle.dump(array, file)
    print("Serialized successfully")

    # Deserialize the array
    with open('serialized_array.pkl', 'rb') as file:
        loaded_array = pickle.load(file)

    print("Deserialized Array:")
    print(loaded_array)
    return loaded_array


# Create a random NumPy array
random_array = create_random_array((3, 3))

# Print the original array
print("Original Array:")
print(random_array)

# Serialize and deserialize the array
restored_array = serialize_and_deserialize(random_array)

# Optional check
print("\nArrays are equal:", np.array_equal(random_array, restored_array))

Original Array:
[[0.37454012 0.95071431 0.73199394]
 [0.59865848 0.15601864 0.15599452]
 [0.05808361 0.86617615 0.60111501]]
Serialized successfully
Deserialized Array:
[[0.37454012 0.95071431 0.73199394]
 [0.59865848 0.15601864 0.15599452]
 [0.05808361 0.86617615 0.60111501]]

Arrays are equal: True
