<a href="https://colab.research.google.com/github/dornercr/DSCI511/blob/main/DSCI51_week3_ToyDataSet_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
#pip install xmltodict

from pprint import pprint
import csv
from io import StringIO
import xmltodict



In [20]:
# %% [markdown]
# ## 2. Simulated Transit Arrivals (Bus API)
# Real-time transit APIs often provide structured JSON data that includes
# lists of arriving buses, their estimated arrival times, and status.
# Here, we simulate 10,000 arrival entries and analyze them using pandas to
# calculate wait time distributions, delayed percentages, and route frequency.
# This models a realistic public transportation monitoring application.

# %%
import pandas as pd
import random

# Setup possible values
bus_routes = ['Midtown Loop', 'Uptown Express', 'Suburban Shuttle', 'Airport Connector', 'City Circle']
status_pool = ['On Time', 'Delayed', 'Cancelled']

# Generate 10,000 fake bus arrival records
bus_arrival_data = []
for i in range(10000):
    bus = {
        'bus_number': f"{random.randint(1, 99)}{random.choice(['A', 'B', 'C'])}",
        'route': random.choice(bus_routes),
        'arrival_in_min': random.randint(1, 30),
        'status': random.choices(status_pool, weights=[0.7, 0.25, 0.05])[0]
    }
    bus_arrival_data.append(bus)

# Load into DataFrame
bus_df = pd.DataFrame(bus_arrival_data)

# === Data Analysis ===
print("\n=== Sample Bus Arrivals ===")
print(bus_df.head())

print("\n=== Status Counts ===")
print(bus_df['status'].value_counts())

print("\nAverage Arrival Time (min):", bus_df['arrival_in_min'].mean())

print("\n=== Top Routes ===")
print(bus_df['route'].value_counts())



=== Sample Bus Arrivals ===
  bus_number              route  arrival_in_min   status
0        30A   Suburban Shuttle              13  Delayed
1        39C        City Circle              23  On Time
2        77C     Uptown Express              17  On Time
3        31B  Airport Connector              12  On Time
4        96C        City Circle               1  Delayed

=== Status Counts ===
status
On Time      7032
Delayed      2484
Cancelled     484
Name: count, dtype: int64

Average Arrival Time (min): 15.407

=== Top Routes ===
route
City Circle          2051
Midtown Loop         2009
Airport Connector    1992
Uptown Express       1977
Suburban Shuttle     1971
Name: count, dtype: int64


In [21]:
# %% [markdown]
# ## 3. Simulated CSV Schedule (University Courses)
# Many older APIs or internal systems return data in CSV format rather than JSON or XML.
# In this section, we simulate a university’s course scheduling system that exports CSV data
# about courses, professors, scheduled times, departments, and credits.
# We generate 10,000 course records and then parse them using Python’s `csv` module and
# `pandas` for analysis.
# This section teaches how to work with tabular data exported from a system,
# including filtering, aggregation, and cleanup.

# %%
import csv
from io import StringIO
import pandas as pd
import random

departments = ['CS', 'MATH', 'PHYS', 'HIST', 'BIO', 'ENG', 'ECON']
professors = ['Dr. Wu', 'Dr. Singh', 'Dr. Allen', 'Dr. Kim', 'Dr. Zhao', 'Dr. Thomas']
times = ['8:00AM', '9:30AM', '11:00AM', '1:00PM', '2:30PM', '4:00PM']

# Generate 10,000 rows of fake CSV data
csv_lines = ["Course,Professor,Time,Credits"]
for i in range(10000):
    course_code = f"{random.choice(departments)}{random.randint(100, 499)}"
    line = f"{course_code},{random.choice(professors)},{random.choice(times)},{random.choice([3, 4])}"
    csv_lines.append(line)

# Combine lines into one CSV string
csv_data = "\n".join(csv_lines)

# Use StringIO to simulate reading from a CSV file
f = StringIO(csv_data)
reader = csv.reader(f)
rows = list(reader)

# Print first 3 rows (including header)
print("=== Preview CSV Rows ===")
for row in rows[:4]:
    print(row)

# Load into DataFrame for analysis
df_courses = pd.DataFrame(rows[1:], columns=rows[0])

# Convert credits to numeric
df_courses['Credits'] = df_courses['Credits'].astype(int)

# === Analysis ===
print("\n=== Most Common Courses ===")
print(df_courses['Course'].value_counts().head())

print("\n=== Credit Distribution ===")
print(df_courses['Credits'].value_counts())

print("\n=== Courses by Professor ===")
print(df_courses['Professor'].value_counts().head())

print("\n=== Scheduled Times ===")
print(df_courses['Time'].value_counts())


=== Preview CSV Rows ===
['Course', 'Professor', 'Time', 'Credits']
['PHYS445', 'Dr. Thomas', '11:00AM', '3']
['ECON393', 'Dr. Wu', '8:00AM', '3']
['ENG254', 'Dr. Zhao', '8:00AM', '4']

=== Most Common Courses ===
Course
CS377      13
PHYS470    13
PHYS496    11
ECON238    11
MATH128    11
Name: count, dtype: int64

=== Credit Distribution ===
Credits
3    5014
4    4986
Name: count, dtype: int64

=== Courses by Professor ===
Professor
Dr. Allen     1745
Dr. Kim       1723
Dr. Thomas    1669
Dr. Wu        1638
Dr. Zhao      1626
Name: count, dtype: int64

=== Scheduled Times ===
Time
2:30PM     1726
8:00AM     1701
11:00AM    1700
4:00PM     1655
1:00PM     1618
9:30AM     1600
Name: count, dtype: int64


In [22]:
# %% [markdown]
# ## 4. Simulated GeoJSON (Zoo Areas / City Blocks)
# Geo-based APIs like OpenStreetMap or municipal datasets often provide boundaries in the form of GeoJSON,
# which include coordinates that define the borders of regions such as parks, neighborhoods, or zoo enclosures.
# In this section, we simulate 10,000 polygon boundaries — each representing a rectangular “zone”
# (like a city block or animal enclosure).
# This teaches students how polygon coordinates are structured, how to parse GeoJSON-style responses,
# and how to extract bounding shapes from geospatial APIs.

# %%
import random

# Generate 10,000 mock polygon zones
geojson_data = []
for i in range(10000):
    lat_base = 39.90 + random.random() * 0.1  # Latitude around Philly
    lon_base = -75.20 + random.random() * 0.1  # Longitude around Philly

    polygon = {
        'id': i,
        'zone_name': f"Zone_{i}",
        'geojson': {
            'type': 'Polygon',
            'coordinates': [[
                [lon_base, lat_base],
                [lon_base + 0.001, lat_base],
                [lon_base + 0.001, lat_base + 0.001],
                [lon_base, lat_base + 0.001],
                [lon_base, lat_base]  # closing loop
            ]]
        }
    }
    geojson_data.append(polygon)

# View first 2 polygon shapes
from pprint import pprint
print("=== Sample GeoJSON Zones ===")
pprint(geojson_data[:2])

# Flatten for DataFrame analysis
import pandas as pd
flat_geo = []
for zone in geojson_data:
    coords = zone['geojson']['coordinates'][0]
    flat_geo.append({
        'id': zone['id'],
        'zone_name': zone['zone_name'],
        'min_lat': min(c[1] for c in coords),
        'max_lat': max(c[1] for c in coords),
        'min_lon': min(c[0] for c in coords),
        'max_lon': max(c[0] for c in coords)
    })

geo_df = pd.DataFrame(flat_geo)

print("\n=== Bounding Box Summary ===")
print(geo_df.describe())

print("\nTotal unique zones:", geo_df['zone_name'].nunique())


=== Sample GeoJSON Zones ===
[{'geojson': {'coordinates': [[[-75.18838001419057, 39.90937868345754],
                               [-75.18738001419057, 39.90937868345754],
                               [-75.18738001419057, 39.91037868345754],
                               [-75.18838001419057, 39.91037868345754],
                               [-75.18838001419057, 39.90937868345754]]],
              'type': 'Polygon'},
  'id': 0,
  'zone_name': 'Zone_0'},
 {'geojson': {'coordinates': [[[-75.14232370380229, 39.993131274509416],
                               [-75.14132370380229, 39.993131274509416],
                               [-75.14132370380229, 39.994131274509414],
                               [-75.14232370380229, 39.994131274509414],
                               [-75.14232370380229, 39.993131274509416]]],
              'type': 'Polygon'},
  'id': 1,
  'zone_name': 'Zone_1'}]

=== Bounding Box Summary ===
                id       min_lat       max_lat       min_lon       max

In [23]:
# %% [markdown]
# ## 5. Simulated XML Feed (News Articles)
# Some APIs — especially older ones, RSS feeds, or enterprise services — return structured text in XML format.
# XML is hierarchical and tag-based, like HTML. This section simulates an XML news article feed with 10,000 articles,
# each including a `headline`, `author`, and `content` tag.
# We parse the XML using the `xmltodict` library, extract the text from each field, and perform basic text analysis.
# This teaches students how to work with semi-structured formats and introduces parsing techniques
# for systems where JSON is not available.

# %%
import xmltodict
import random
import pandas as pd

# Fake authors and headlines
authors = ['Alice Rivera', 'Michael Lee', 'Jenna Patel', 'Thomas Zhang', 'Maya Brooks']
topics = ['Clean Energy', 'Elections', 'AI Regulation', 'Mars Mission', 'Education Reform']

# Generate 10,000 fake XML articles
xml_articles = []
for i in range(10000):
    xml_string = f"""
    <article>
        <headline>{random.choice(topics)} Bill #{i}</headline>
        <author>{random.choice(authors)}</author>
        <content>This is the full body text of article #{i}, detailing updates about {random.choice(topics)}.</content>
    </article>
    """
    xml_articles.append(xml_string)

# Parse first 5 articles into Python dicts
parsed_articles = []
for x in xml_articles[:5]:
    parsed = xmltodict.parse(x)
    article = parsed['article']
    parsed_articles.append(article)

print("=== Sample Parsed Articles ===")
from pprint import pprint
pprint(parsed_articles)

# Load into DataFrame for further analysis
article_df = pd.DataFrame(parsed_articles)
print("\n=== Sample DataFrame ===")
print(article_df.head())

# Analyze most common authors
print("\n=== Author Frequency ===")
print(article_df['author'].value_counts())

# Analyze word counts in content
article_df['word_count'] = article_df['content'].str.split().apply(len)
print("\nAverage word count:", article_df['word_count'].mean())


=== Sample Parsed Articles ===
[{'author': 'Michael Lee',
  'content': 'This is the full body text of article #0, detailing updates '
             'about Education Reform.',
  'headline': 'Education Reform Bill #0'},
 {'author': 'Alice Rivera',
  'content': 'This is the full body text of article #1, detailing updates '
             'about AI Regulation.',
  'headline': 'Education Reform Bill #1'},
 {'author': 'Thomas Zhang',
  'content': 'This is the full body text of article #2, detailing updates '
             'about Education Reform.',
  'headline': 'Elections Bill #2'},
 {'author': 'Alice Rivera',
  'content': 'This is the full body text of article #3, detailing updates '
             'about Mars Mission.',
  'headline': 'Elections Bill #3'},
 {'author': 'Maya Brooks',
  'content': 'This is the full body text of article #4, detailing updates '
             'about Clean Energy.',
  'headline': 'Education Reform Bill #4'}]

=== Sample DataFrame ===
                   headline        

In [24]:
# %% [markdown]
# ## 6. Simulated Sports Stats (Basketball Match Data)
# APIs that provide sports data often deliver complex, nested JSON structures with deeply connected entities:
# teams, players, scores, and events. Services like Sportradar expose this data through authentication-based APIs.
# In this simulation, we generate 10,000 basketball game summaries, including team names, home/away roles, final scores,
# and top scorers. The exercise demonstrates parsing nested JSON objects, calculating team performance metrics,
# and identifying trends (e.g., average point difference or high scorers).
# This is foundational for sports analytics dashboards and event stream processing.

# %%
import pandas as pd
import random

teams = ['Lions', 'Tigers', 'Sharks', 'Wolves', 'Dragons', 'Falcons']
players = ['Jordan Wells', 'Avery Black', 'Riley Chen', 'Chris Strong', 'Skylar Moore']

# Generate 10,000 game records
games = []
for i in range(10000):
    home_team = random.choice(teams)
    away_team = random.choice([t for t in teams if t != home_team])
    home_score = random.randint(70, 130)
    away_score = random.randint(70, 130)
    top_scorer = random.choice(players)
    top_points = random.randint(20, 45)

    game = {
        'game_id': f"game_{i}",
        'home_team': home_team,
        'away_team': away_team,
        'home_score': home_score,
        'away_score': away_score,
        'winner': home_team if home_score > away_score else away_team,
        'top_scorer': top_scorer,
        'points_scored': top_points
    }
    games.append(game)

# Load into DataFrame
game_df = pd.DataFrame(games)

print("=== Sample Games ===")
print(game_df.head())

# Calculate average point difference
game_df['point_diff'] = abs(game_df['home_score'] - game_df['away_score'])
print("\nAverage Point Difference:", game_df['point_diff'].mean())

# Count team wins
print("\n=== Team Wins ===")
print(game_df['winner'].value_counts())

# Top scorers by frequency
print("\n=== Frequent Top Scorers ===")
print(game_df['top_scorer'].value_counts())

# Highest individual scoring performance
max_points = game_df['points_scored'].max()
print("\n=== Max Points in a Game:", max_points)
print(game_df[game_df['points_scored'] == max_points])


=== Sample Games ===
  game_id home_team away_team  home_score  away_score   winner    top_scorer  \
0  game_0     Lions   Dragons         114         129  Dragons   Avery Black   
1  game_1    Sharks   Dragons          72          99  Dragons  Chris Strong   
2  game_2   Falcons     Lions         113         100  Falcons  Jordan Wells   
3  game_3   Dragons    Tigers         104          76  Dragons  Skylar Moore   
4  game_4   Falcons     Lions         119         130    Lions   Avery Black   

   points_scored  
0             21  
1             36  
2             22  
3             20  
4             29  

Average Point Difference: 20.2475

=== Team Wins ===
winner
Falcons    1709
Sharks     1703
Lions      1693
Dragons    1674
Wolves     1620
Tigers     1601
Name: count, dtype: int64

=== Frequent Top Scorers ===
top_scorer
Jordan Wells    2051
Avery Black     2016
Riley Chen      1991
Skylar Moore    1975
Chris Strong    1967
Name: count, dtype: int64

=== Max Points in a Game: 45

In [25]:
# %% [markdown]
# ## 7. Simulated Social Feed (Lab Journal Updates)
# APIs like Twitter (or X) return chronological, short-form updates in a structured JSON format. These data streams are
# excellent for time-series analysis, event monitoring, or content filtering. In this simulation, we generate 10,000
# lab journal entries formatted like tweets — each with a timestamp and brief update text.
# The focus here is on understanding how social/post feed APIs deliver real-time or recent items, and how to:
# - Parse timestamped records
# - Filter by keywords
# - Count frequency of posts over time
# This prepares students to work with social APIs, chat logs, or any time-stamped communications platform.

# %%
import pandas as pd
import random
from datetime import datetime, timedelta

# Setup
topics = ['protein folding', 'paper submission', 'experimental alignment', 'data cleanup', 'model training']
people = ['Dr. Lee', 'Alex', 'Jordan', 'Casey', 'Ravi']

# Generate timestamps over the past 90 days
base_time = datetime.now()
timestamps = [base_time - timedelta(minutes=random.randint(0, 60*24*90)) for _ in range(10000)]

# Generate 10,000 fake lab feed posts
lab_feed = []
for i in range(10000):
    post = {
        'timestamp': timestamps[i],
        'text': f"{random.choice(people)} updated: {random.choice(topics)} progress at checkpoint {random.randint(1, 10)}."
    }
    lab_feed.append(post)

# Convert to DataFrame
lab_df = pd.DataFrame(lab_feed)
lab_df['date'] = lab_df['timestamp'].dt.date
lab_df['hour'] = lab_df['timestamp'].dt.hour

print("=== Sample Lab Posts ===")
print(lab_df.head())

# Filter posts with keyword 'protein'
protein_posts = lab_df[lab_df['text'].str.contains('protein')]
print("\nTotal posts mentioning 'protein folding':", len(protein_posts))

# Posts per day
print("\n=== Posts per Day ===")
print(lab_df['date'].value_counts().sort_index().tail(10))

# Most active hours
print("\n=== Most Active Posting Hours ===")
print(lab_df['hour'].value_counts().sort_index())

# Word count stats
lab_df['word_count'] = lab_df['text'].str.split().apply(len)
print("\nAverage words per post:", lab_df['word_count'].mean())


=== Sample Lab Posts ===
                   timestamp  \
0 2025-08-14 15:39:04.184733   
1 2025-08-23 02:45:04.184733   
2 2025-09-25 05:21:04.184733   
3 2025-09-04 07:29:04.184733   
4 2025-08-04 15:46:04.184733   

                                                text        date  hour  
0  Ravi updated: experimental alignment progress ...  2025-08-14    15  
1  Ravi updated: paper submission progress at che...  2025-08-23     2  
2  Casey updated: paper submission progress at ch...  2025-09-25     5  
3  Ravi updated: experimental alignment progress ...  2025-09-04     7  
4  Casey updated: data cleanup progress at checkp...  2025-08-04    15  

Total posts mentioning 'protein folding': 2026

=== Posts per Day ===
date
2025-09-28    111
2025-09-29    122
2025-09-30    110
2025-10-01    102
2025-10-02    117
2025-10-03     96
2025-10-04    106
2025-10-05    118
2025-10-06    104
2025-10-07     77
Name: count, dtype: int64

=== Most Active Posting Hours ===
hour
0     432
1     424
2 

In [26]:
# %% [markdown]
# ## 8. Simulated Yelp API (Pizza Finder Business Search)
# Local business APIs like Yelp and Google Places allow developers to query businesses by filters such as location,
# rating, price, and category. These APIs return structured JSON that includes fields like `name`, `rating`,
# `distance`, `price`, and `contact info`.
# In this simulation, we generate 10,000 pizza shop records and show how to:
# - Filter businesses by rating or price
# - Sort results by distance (nearest)
# - Analyze popular price points and average ratings
# This mimics how ride-sharing, delivery, and restaurant apps use real-time location-aware business data.

# %%
import pandas as pd
import random

# Set up categories
names = ['Tony’s Pizza', 'Mama’s Slice', 'Pizza Palace', 'Big Al’s Pies', 'Cheesy Bites', 'Neapolitan Express']
price_levels = ['$', '$$', '$$$']
area_codes = ['215', '610', '484', '267']

# Generate 10,000 fake pizza shops
pizza_shops = []
for i in range(10000):
    shop = {
        'name': random.choice(names) + f" #{random.randint(1, 500)}",
        'rating': round(random.uniform(2.5, 5.0), 1),
        'review_count': random.randint(10, 1200),
        'price': random.choices(price_levels, weights=[0.5, 0.4, 0.1])[0],
        'distance_m': random.randint(100, 10000),
        'phone': f"+1-{random.choice(area_codes)}-{random.randint(100,999)}-{random.randint(1000,9999)}"
    }
    pizza_shops.append(shop)

# Load into DataFrame
pizza_df = pd.DataFrame(pizza_shops)

print("=== Sample Pizza Places ===")
print(pizza_df.head())

# Filter: rating >= 4.5 and price <= $$
top_picks = pizza_df[(pizza_df['rating'] >= 4.5) & (pizza_df['price'].isin(['$', '$$']))]
print(f"\nTop Picks (rating ≥ 4.5 and $/$$): {len(top_picks)} places")
print(top_picks.sort_values('rating', ascending=False).head(5))

# Closest shop
print("\nClosest pizza place:")
print(pizza_df.sort_values('distance_m').head(1))

# Rating distribution
print("\n=== Rating Distribution ===")
print(pizza_df['rating'].value_counts().sort_index().tail(10))

# Price level breakdown
print("\n=== Price Breakdown ===")
print(pizza_df['price'].value_counts())

# Correlation: reviews vs. rating
print("\n=== Correlation: Review Count vs Rating ===")
print(pizza_df[['review_count', 'rating']].corr())


=== Sample Pizza Places ===
                      name  rating  review_count price  distance_m  \
0        Pizza Palace #354     3.6           276    $$        7481   
1   Neapolitan Express #56     3.4           125     $        8351   
2  Neapolitan Express #270     4.3          1169    $$        4872   
3        Pizza Palace #112     4.1          1123   $$$        8789   
4        Tony’s Pizza #162     3.3          1062   $$$        1306   

             phone  
0  +1-267-493-8346  
1  +1-267-557-9614  
2  +1-610-366-9403  
3  +1-610-583-8798  
4  +1-267-405-6992  

Top Picks (rating ≥ 4.5 and $/$$): 1949 places
                    name  rating  review_count price  distance_m  \
59    Big Al’s Pies #302     5.0          1137     $        2042   
9999   Tony’s Pizza #148     5.0          1056    $$        3361   
10     Mama’s Slice #322     5.0            84    $$        6118   
2763   Big Al’s Pies #57     5.0            95    $$         796   
2907    Pizza Palace #88     5.0     