
1. Read this url and find the 10 most frequent words. romeo_and_juliet = 'http://www.gutenberg.org/files/1112/1112.txt'

In [None]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import re

# Fetch the content from the URL
url = 'http://www.gutenberg.org/files/1112/1112.txt'
response = requests.get(url)
html_content = response.text

# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')

# Extract text from the HTML
text = soup.get_text()

# Tokenize the text into words
words = re.findall(r'\b\w+\b', text.lower())

# Filter out common English stop words 
# You can customize this list 
stop_words = set(['the', 'and', 'to', 'of', 'a', 'i', 'you', 'it', 'in', 'is'])

filtered_words = [word for word in words if word not in stop_words]

# Count the occurrences of each word
word_counts = Counter(filtered_words)

# Get the 10 most frequent words
most_common_words = word_counts.most_common(10)

# Print the results
for word, count in most_common_words:
    print(f'{word}: {count}')


gutenberg: 4
project: 3
about: 3
contact: 3
help: 3
404: 2
privacy: 2
policy: 2
terms: 2
use: 2


2. Read the cats API and cats_api = 'https://api.thecatapi.com/v1/breeds' and find :
the min, max, mean, median, standard deviation of cats' weight in metric units.
the min, max, mean, median, standard deviation of cats' lifespan in years.
Create a frequency table of country and breed of cats

In [None]:
import requests
import pandas as pd
import statistics

# Fetch data from the Cat API
cats_api = 'https://api.thecatapi.com/v1/breeds'
response = requests.get(cats_api)
cats_data = response.json()

# Create a DataFrame from the API response
cats_df = pd.DataFrame(cats_data)

# Convert weight to metric units
cats_df['weight_metric'] = cats_df['weight'].apply(lambda x: x['metric'].split()[0] if x['metric'] else None)

# Convert lifespan to years
cats_df['lifespan_years'] = cats_df['life_span'].apply(lambda x: int(x.split()[0]) if x else None)

# Convert weight and lifespan columns to numeric
cats_df['weight_metric'] = pd.to_numeric(cats_df['weight_metric'], errors='coerce')
cats_df['lifespan_years'] = pd.to_numeric(cats_df['lifespan_years'], errors='coerce')

# Calculate statistics for weight and lifespan
weight_stats = {
    'min': cats_df['weight_metric'].min(),
    'max': cats_df['weight_metric'].max(),
    'mean': cats_df['weight_metric'].mean(),
    'median': cats_df['weight_metric'].median(),
    'std_dev': cats_df['weight_metric'].std()
}

lifespan_stats = {
    'min': cats_df['lifespan_years'].min(),
    'max': cats_df['lifespan_years'].max(),
    'mean': cats_df['lifespan_years'].mean(),
    'median': cats_df['lifespan_years'].median(),
    'std_dev': cats_df['lifespan_years'].std()
}

# print statistics
print("Statistics for Cats' Weight (metric units):")
print(weight_stats)

print("\nStatistics for Cats' Lifespan (years):")
print(lifespan_stats)

# Create a frequency table of country and breed
frequency_table = cats_df.groupby(['origin', 'name']).size().reset_index(name='count')

print("\nFrequency Table of Country and Breed of Cats:")
print(frequency_table)


Statistics for Cats' Weight (metric units):
{'min': 2, 'max': 5, 'mean': 3.2238805970149254, 'median': 3.0, 'std_dev': 0.8845628182703051}

Statistics for Cats' Lifespan (years):
{'min': 8, 'max': 18, 'mean': 12.074626865671641, 'median': 12.0, 'std_dev': 1.8283411328456127}

Frequency Table of Country and Breed of Cats:
           origin              name  count
0       Australia   Australian Mist      1
1           Burma           Burmese      1
2           Burma  European Burmese      1
3          Canada            Cymric      1
4          Canada            Sphynx      1
..            ...               ...    ...
62  United States          Savannah      1
63  United States       Selkirk Rex      1
64  United States          Snowshoe      1
65  United States            Toyger      1
66  United States    York Chocolate      1

[67 rows x 3 columns]


3. Read the countries API and find
the 10 largest countries
the 10 most spoken languages
the total number of languages in the countries API

In [9]:
import requests

# Fetch data from the Restcountries API
countries_api = 'https://restcountries.com/v3.1/all'
response = requests.get(countries_api)
countries_data = response.json()

# Task 1: Find the 10 largest countries
def get_area(country):
    area = country.get('area')
    if isinstance(area, dict):
        return area.get('total', 0)
    elif isinstance(area, (int, float)):
        return area
    else:
        return 0

largest_countries = sorted(countries_data, key=get_area, reverse=True)[:10]
print("10 Largest Countries:")
for country in largest_countries:
    print(country['name']['common'])

# Task 2: Find the 10 most spoken languages
all_languages = [language for country in countries_data for language in country.get('languages', {}).values()]
top_languages = sorted(set(all_languages), key=all_languages.count, reverse=True)[:10]
print("\n10 Most Spoken Languages:")
for language in top_languages:
    print(language)

# Task 3: Find the total number of languages in the countries API
all_languages_set = set(all_languages)
total_languages = len(all_languages_set)
print("\nTotal Number of Languages in the Countries API:", total_languages)


10 Largest Countries:
Russia
Antarctica
Canada
China
United States
Brazil
Australia
India
Argentina
Kazakhstan

10 Most Spoken Languages:
English
French
Arabic
Spanish
Portuguese
Dutch
Russian
German
Chinese
Tswana

Total Number of Languages in the Countries API: 155


4. UCI is one of the most common places to get data sets for data science and machine learning. Read the content of UCL (https://archive.ics.uci.edu/ml/datasets.php). Without additional libraries it will be difficult, so you may try it with BeautifulSoup4

In [5]:
import requests
from bs4 import BeautifulSoup

# URL of the UCI Machine Learning Repository
uci_url = 'https://archive.ics.uci.edu/'

# Send a GET request to the URL
response = requests.get(uci_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract and print the text content of the page
    print(soup.get_text())
else:
    print(f"Failed to retrieve data. Status code: {response.status_code}")













UCI Machine Learning Repository

Home - UCI Machine Learning Repository




       Datasets Contribute Dataset Donate New Link External About Us Who We Are Citation Metadata Contact Information           Login  Welcome to the UC Irvine Machine Learning Repository We currently maintain 663 datasets as a service to the machine learning community.
          Here, you can donate and find datasets used by millions of people all around the world! View Datasets Contribute a Dataset Popular Datasets     Iris A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.
  Classification  150 Instances  4 Features      Dry Bean Dataset Images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.  Classification  13.61K Instances  16 Features      Rice (Cammeo and Osmancik) A total of 3810 rice grain