<a href="https://colab.research.google.com/github/chocograhams/IMT542/blob/main/i4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Example 1**

**Information Structure**: CSV

**Access Technology**: Manual file upload and read locally in Colab

**Description:** Reads CSV data from file (Repeat tags cleaned_SeaCoyottee - SME Highlights 2025-02-09 20_41.csv) uploadeded to Google Colab and prints the 20 most commonly repeated phrases consisting three words. This is also a text mining script.

**Pros:**
*   CSV is a simple and widely used format for tabular data
*   Uploading files to Colab is straightforward
*   The 'csv' module is built into Python, no extra installations needed
*   Allows working with data you might have locally

**Cons:**
*   Requires the user to manually upload the file each session
*   Data is static unless the uploaded file is changed
*   No inherent complex structure beyond rows and columns


In [40]:

import pandas as pd
from collections import Counter

def find_top_phrases(filename="/content/Repeat tags cleaned_SeaCoyottee - SME Highlights 2025-02-09 20_41.csv", top_n=20):
    try:
        df = pd.read_csv(filename)  # Assuming the CSV has a relevant text column

        if 'Text' not in df.columns:
            print("Error: 'text' column not found in the CSV. Please ensure the CSV has a column named 'text' containing the text data.")
            return

        all_text = ' '.join(df['Text'].astype(str).tolist()).lower() # Combine all text and lowercase

        words = all_text.split()
        noun_verb_phrases = []

        # Basic phrase extraction
        for i in range(len(words) - 2):  # Iterate over 3-word windows
            noun_verb_phrases.append(tuple(words[i:i+3]))

        phrase_counts = Counter(noun_verb_phrases)

        # Print top N phrases
        print(f"Top {top_n} 3-word phrases:")
        for phrase, count in phrase_counts.most_common(top_n):
            print(f"{phrase}: {count}")

    except FileNotFoundError:
        print(f"Error: The file '{filename}' was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
  find_top_phrases()

Top 20 3-word phrases:
('a', 'lot', 'of'): 20
('i', 'mean,', 'i'): 12
('a', 'little', 'bit'): 8
('the', 'city', 'of'): 7
('lot', 'of', 'people'): 7
('i', "don't", 'know'): 6
('in', 'their', 'neighborhood'): 6
('there', 'was', 'a'): 6
('little', 'bit', 'more'): 5
('things', 'like', 'that.'): 5
('yeah,', 'i', 'think'): 5
('you', 'never', 'know'): 5
('but', 'i', "don't"): 5
('you', 'know,', 'i'): 4
('i', "don't", 'have'): 4
('i', 'would', 'just'): 4
('just', 'like', 'a'): 4
('city', 'of', 'seattle'): 4
('i', 'think', 'those'): 4
('of', 'people', "don't"): 4


# Example 2

**Information Structure**: JSON

**Access Technology**: API connection over HTTP

**Description**: Fetches JSON data from a public API and prints sample.

**Pros:**
*   Many public APIs offer free and easily accessible data
*   The 'requests' library is often pre-installed in Colab
*   JSON is a common and relatively easy-to-understand data format
*   No local file management needed initially

**Cons**:
*   Requires an internet connection
*   API availability and structure can change
*   Some APIs might have rate limits

In [13]:
import requests
import json

def get_and_print_json_from_api():

    api_url = "https://api.thecatapi.com/v1/images/search"
    try:
        response = requests.get(api_url)
        response.raise_for_status()  # Raise an exception for bad status codes
        json_data = response.json()
        print("JSON data from API:")
        print(json_data)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching data from API: {e}")

if __name__ == "__main__":
    get_and_print_json_from_api()

JSON data from API:
[{'id': 'cp3', 'url': 'https://cdn2.thecatapi.com/images/cp3.jpg', 'width': 640, 'height': 480}]


# Example 3
**Information Structure**: HTML

**Access Technology**: HTTP to download webpage (using requests) and basic parsing

**Description**:Downloads an HTML webpage using HTTP and prints its title and the text content of the first, second, and third paragraphs.

**Pros**:
*   Demonstrates fetching data from the internet beyond structured APIs
*   'requests' is often pre-installed in Colab.

**Cons**:
*   Requires an internet connection
*   HTML structure variability can make consistent data extraction challenging
*   Even this slightly more complex extraction requires understanding basic HTML tags
*   Requires installing 'beautifulsoup4'


In [44]:
import requests
from bs4 import BeautifulSoup

def get_and_print_multiple_paragraphs(url="https://www.gutenberg.org/about/background/50years.html"):

    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract the title
        title_tag = soup.find('title')
        title = title_tag.string.strip() if title_tag and title_tag.string else "No title found"
        print(f"\nDetails from webpage at {url}:")
        print(f"Title: {title}")

        # Extract the first two paragraphs
        paragraph_tags = soup.find_all('p', limit=3)  # Find the first 3 <p> tags

        if paragraph_tags:
            for i, p_tag in enumerate(paragraph_tags):
                paragraph_text = p_tag.text.strip()
                print(f"Paragraph {i+1}: {paragraph_text}")

        elif len(paragraph_tags) == 1:
            paragraph_text = paragraph_tags[0].text.strip()
            print(f"Paragraph 1: {paragraph_text}")
            print("Only one paragraph tag found on this webpage.")

    except requests.exceptions.RequestException as e:
        print(f"Error fetching webpage at {url}: {e}")
    except Exception as e:
        print(f"An error occurred while processing the webpage: {e}")

if __name__ == "__main__":
    # Make sure you have installed beautifulsoup4 in Colab:
    # !pip install beautifulsoup4
    get_and_print_multiple_paragraphs()


Details from webpage at https://www.gutenberg.org/about/background/50years.html:
Title: 50 years of eBooks 1971-2021 | Project Gutenberg
Paragraph 1: Ways to donate
Paragraph 2: July 4, 2021 – In 1997, Time-Life magazine picked the movable type printing press as the most important invention of the second millennium. Like most important innovations and social changes, the printing press was an evolution that had deep roots in history.
Paragraph 3: Move forward in time to 1971, when Michael Hart invented the eBook. Like Gutenberg’s printing press, Hart’s innovation followed decades of prior work. To name a few, this includes Vannevar Bush’s “Memex” (1930s, based on microfiche), Bob Brown’s “The Readies” (1930s), Brown University’s “FRESS” (1960s), Ted Nelson’s Xanadu (1960s), and many others.
