# Exercise 1 : Identifying Data Types


Below are various data sources. Identify whether each one is an example of structured or unstructured data.

*  A company’s financial reports stored in an Excel file.
*  Photographs uploaded to a social media platform.
*  A collection of news articles on a website.
*  Inventory data in a relational database.
*  Recorded interviews from a market research study.


--------------------------------------------------------------------------

*  Financial reports in an Excel file are typically organized in rows and columns with clearly defined data types. This is structured data format.
*  Photographs are a form of unstructured data as they do not follow a specific
data model. However,  metadata like tags, location, timestamps and so on which associated with the photos can be structured.
*  News articles are primarily unstructured data because they are composed of free text. Although some information about article, like an author, title, content, timestamp may be stored in a structured format.
*   Inventory data in a relational database is structured data. It is organized in tables with rows and columns, with defined data types and relationships between different tables.
*   Recorded interviews are unstructured data because they consist of audio or video recordings, which do not have a predefined data model.

# Exercise 2 : Transformation Exercise

For each of the following unstructured data sources, propose a method to convert it into structured data. Explain your reasoning.
- A series of blog posts about travel experiences.
- Audio recordings of customer service calls.
- Handwritten notes from a brainstorming session.
- A video tutorial on cooking.


---
- We can save blog posts to a database table where each row represents a blog post and columns include metadata (title, author, date), entities (locations, activities) and topic classification(category).
- We can create a database table where each row represents a call and columns include metadata (call ID, date, duration, employee, customer) and extracted details (customer issue, resolution, customer feedback).
- Use OCR software to digitize handwritten notes, converting them into machine-readable text.And we can save this data to a database table where each row represents a note and columns include metadata (author, date), recognized text, and categorized sections or keywords.
- Extract structured information about the video, such as title, duration, speaker, and topics covered. Create a database table where each row represents a segment of the tutorial and columns include metadata (title, duration, ), text transcript of instructions and link to the video.

# Exercise 3 : Application Scenario

You are a data analyst at a retail company. You have access to various data sources, including transaction records, customer feedback comments, social media posts about your brand, and employee work schedules.

- Categorize each of these data sources as structured or unstructured.
- Suggest how you might use each type of data for improving the company’s business operations.



---
- These records are typically stored in a relational database or spreadsheet, with well-defined fields such as transaction ID, date, time, item purchased, quantity, price, customer ID, etc. Analyze transaction data to identify sales trends, peak purchasing times, and popular products. Use transaction data to segment customers based on purchasing behavior or optimizing operations and resource allocation.
- Customer feedback comments are free-form text and do not follow a predefined format, making them unstructured. But we can store the metadata about comments, such as customer, timestamp, customer support manager and so on. Perform sentiment analysis on feedback comments to gauge customer satisfaction and identify areas of concern. Insights from feedback allow to enhance the overall customer experience and to improve product quality.
- Unstructured Data is similar to customer feedback comments are unstructured text data, which can include a mix of text, images, and videos. But we can store some metadata as weel. Analyze engagement metrics such as likes, shares, and comments to measure the effectiveness of social media campaigns and customer engagement strategies.
- Work schedules are usually maintained in a structured format such as a spreadsheet or a database. We can track employee attendance and punctuality to identify patterns that might affect productivity. This can inform decisions on workforce management and scheduling practices.

# Exercise 4 : Synthetic Data Generation

- Use the Python Faker library to generate a list of 100 random names, addresses, and email addresses.
- Use numpy to add a column of random ages (between 20 and 60) and a column of random income levels (within a reasonable range).
- Combine this data into a pandas DataFrame and name the columns appropriately.
- Display the first 10 rows of your DataFrame.

In [None]:
!pip install Faker


Collecting Faker
  Downloading Faker-25.5.0-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Faker
Successfully installed Faker-25.5.0


In [None]:
from faker import Faker
import numpy as np
import pandas as pd
from random import randint

fake = Faker()
array = np.array([[fake.name(), fake.address(), fake.email()] for _ in range(100)])
age = [[randint(20,60)] for f in range(100)]
income = [[randint(3500, 7000)] for f in range(100)]
array = np.append(array, age, axis=1)
array = np.append(array, income, axis=1)
df = pd.DataFrame(array, columns = ['name', 'address', 'email', 'age', 'income, $'])
df.head()

Unnamed: 0,name,address,email,age,"income, $"
0,Amy Baird,"02982 Joseph Station Suite 919\nWest Erika, NV...",stephaniemorse@example.net,23,5280
1,David Benitez,"643 Cunningham Villages Apt. 556\nFosterview, ...",vstewart@example.org,26,4111
2,Danielle Gray,950 Johnson Glens Suite 801\nNew Tannerborough...,qjimenez@example.org,34,4358
3,Ryan Jones,"693 Dillon Highway Apt. 078\nChadfurt, PW 49179",john18@example.net,40,5525
4,Hannah Peck,"86705 Phillips Haven Suite 726\nNew Janiceton,...",ncameron@example.org,50,4765


# Exercise 5 : Data Augmentation For Images

You have a dataset of dog images, and you want to augment this dataset to improve a machine learning model’s performance.
Using the ImageDataGeneratorfrom the TensorFlow Keras library, apply the following transformations to your dataset:

- Rotate the images by various angles (up to 30 degrees).
- Flip the images horizontally and vertically.
- Vary the brightness of the images within a specific range.\
Save these augmented images in a new folder.

In [None]:
!pip install tensorflow



In [None]:
!kaggle datasets download -d yaswanthgali/dog-images

Dataset URL: https://www.kaggle.com/datasets/yaswanthgali/dog-images
License(s): CC0-1.0
Downloading dog-images.zip to /content
100% 749M/750M [00:06<00:00, 104MB/s] 
100% 750M/750M [00:06<00:00, 118MB/s]


In [None]:
!unzip dog-images.zip

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: images/images/n02108089-boxer/n02108089_11687.jpg  
  inflating: images/images/n02108089-boxer/n02108089_117.jpg  
  inflating: images/images/n02108089-boxer/n02108089_11807.jpg  
  inflating: images/images/n02108089-boxer/n02108089_11875.jpg  
  inflating: images/images/n02108089-boxer/n02108089_122.jpg  
  inflating: images/images/n02108089-boxer/n02108089_12232.jpg  
  inflating: images/images/n02108089-boxer/n02108089_125.jpg  
  inflating: images/images/n02108089-boxer/n02108089_12738.jpg  
  inflating: images/images/n02108089-boxer/n02108089_12739.jpg  
  inflating: images/images/n02108089-boxer/n02108089_12827.jpg  
  inflating: images/images/n02108089-boxer/n02108089_13340.jpg  
  inflating: images/images/n02108089-boxer/n02108089_13526.jpg  
  inflating: images/images/n02108089-boxer/n02108089_1353.jpg  
  inflating: images/images/n02108089-boxer/n02108089_1355.jpg  
  inflating: images/images/n02108

In [None]:
import tensorflow as tf
import numpy as np
dog = tf.keras.utils.load_img('images/images/n02111277-Newfoundland/n02111277_13537.jpg')
img = np.asarray(dog)
trans_img = tf.keras.preprocessing.image.ImageDataGenerator()
rotation = trans_img.apply_transform(img, transform_parameters={'theta': 40})
tf.keras.utils.save_img('/content/images/dogs', rotation, data_format='channels_last', file_format='.png')

KeyError: '.PNG'

In [None]:
trans_img.apply_transform(img, transform_parameters={'flip_horizontal': True})

In [None]:
trans_img.apply_transform(img, transform_parameters={'flip_vertical': True})

In [None]:
trans_img.apply_transform(img, transform_parameters={'brightness': 10})
trans_img.sa

array([[[ 50., 140., 110.],
        [150., 240., 210.],
        [230., 255., 255.],
        ...,
        [ 90., 140., 100.],
        [ 90., 140., 100.],
        [100., 150., 110.]],

       [[140., 230., 200.],
        [180., 255., 240.],
        [190., 255., 250.],
        ...,
        [160., 210., 170.],
        [110., 160., 120.],
        [ 60., 110.,  70.]],

       [[180., 255., 240.],
        [160., 250., 220.],
        [130., 220., 190.],
        ...,
        [255., 255., 255.],
        [255., 255., 255.],
        [255., 255., 255.]],

       ...,

       [[255., 255., 255.],
        [255., 255., 255.],
        [255., 255., 255.],
        ...,
        [255., 255., 255.],
        [255., 255., 255.],
        [255., 255., 255.]],

       [[255., 255., 255.],
        [255., 255., 255.],
        [255., 255., 255.],
        ...,
        [255., 255., 255.],
        [255., 255., 255.],
        [255., 255., 255.]],

       [[255., 255., 255.],
        [255., 255., 255.],
        [255., 2

# Exercise 6 : Simulation-Based Dataset Creation

Imagine you are developing a simulation for traffic flow in a city. Your goal is to generate a dataset that reflects different traffic conditions at various times of the day.\
Outline the steps you would take to create this simulation. Consider factors like vehicle types, road types, traffic signals, and peak/off-peak hours.
Describe how you would collect data from this simulation, specifying the types of data (e.g., vehicle count, average speed) you would gather at different time intervals.\
Discuss the potential uses of this simulated dataset in traffic management and urban planning.



---

To create a traffic flow simulation, start by defining the objectives and scope, including different vehicle types (cars, buses, trucks), road types (highways, arterial roads, residential streets), and time periods (peak and off-peak hours). Design a city layout with various zones (commercial, residential, industrial) and road networks, incorporating traffic rules and signals.

Collect data on vehicle counts, average speeds, travel times, congestion levels, and traffic signal impacts at different intervals. High-frequency data collection during peak hours (e.g., every minute) captures detailed traffic dynamics, while longer intervals during off-peak times (e.g., every 15 minutes) reduce data volume. The simulated dataset helps optimize traffic signal timings and manage congestion in real-time.

It aids in infrastructure planning by identifying congestion hotspots and planning new road constructions or modifications. Public transportation planning benefits from analyzing traffic patterns to optimize routes and schedules. Policy-making is informed by evaluating congestion pricing schemes and zoning laws based on traffic data. Environmental impact studies use the data to assess traffic's effect on air quality and promote sustainable transport solutions.

The simulation improves traffic management, urban planning, and decision-making processes, enhancing the overall quality of life in the city by reducing congestion and its associated impacts. By leveraging detailed traffic data, urban planners can make informed decisions that lead to more efficient and sustainable urban environments.