## (2) Source and investigate usable data sources 

Links to read:

[1] 

[2] 

[3]

Sourcing and investigating usable data sources is crucial. This involves identifying relevant data that can fine-tune the LLM to ensure the agent’s responses and recommendations have accurate and comprehensive information to interact with users effectively. By evaluating and selecting the right data sources, developers can enhance the virtual agent’s performance, making it more reliable and relevant in addressing user queries and providing tailored assistance.

In the world of AI, data is the fuel that powers your models. The quality and quantity of your data directly impact the accuracy and performance of your AI applications.

##### (2A) Identifying Relevant Sources
- Research existing data sources: Explore publicly available datasets, academic repositories, industry-specific databases, and other potential sources. Some websites such as Kaggle and Hugging Face have publicly available datasets as well as valuable machine learning and artificial intelligence learning resources. 

- Consider data availability and accessibility: Evaluate whether the identified data sources are accessible and if there are any restrictions or costs associated with obtaining the data.

- Leverage internal data: Utilize existing company knowledge bases, records, and other internal data sources to supplement external data.

- Consider data ethics and privacy: Ensure that data acquisition and use comply with relevant ethical guidelines and privacy regulations.

- Understanding Common Data Storage Formats for Existing Datasets
    When searching for and utilizing existing datasets, you'll encounter them in various storage formats. Data Storage Formats fall into two general categories: Structured Data Formats and Unstructured Data Formats:
    - Structured Data Formats
    
        - SQL Databases: These are widely used for storing structured data with defined fields and relationships. They are ideal for relational data, such as customer information, sales data, and inventory records.
        
        - CSV (Comma-Separated Values): A simple text format where values are separated by commas. CSV files are often used for exporting data from spreadsheets or databases.

        - JSON (JavaScript Object Notation): A human-readable data format that stores data in key-value pairs. JSON is commonly used for representing structured data in APIs and web applications.
    - Unstructured Data Formats

        - Text Files: Plain text files can store various types of data, including documents, code, and log files.

        - PDF (Portable Document Format): A file format used to represent documents, including text, images, and graphics.

        - XML (Extensible Markup Language): A markup language for storing and transporting data. XML is often used for structured data, but it can also store unstructured content.

        - Images: Formats like JPEG, PNG, and GIF are used for storing images.
        
        - Audio and Video: Formats like MP3, WAV, MP4, and AVI are used for storing audio and video files.

##### (2B) Evaluating Data Quality #Check google's guidelines about this
Once you've identified relevant data sources, it's crucial to evaluate their quality. This ensures that the data you're using is accurate, reliable, and suitable for training your virtual agent. Here are the key factors to consider:

- Accuracy: Verify the data's accuracy and reliability. This might involve checking sources, comparing with other data, or using data validation techniques.

- Completeness: Ensure the data is comprehensive and doesn't have significant gaps. Identify missing values and consider imputation methods if necessary.

- Consistency: Check for consistency in data formats and definitions. Standardize data if needed to ensure uniformity.

- Timeliness: Consider the age of the data. Outdated data may not be relevant for training your virtual agent.

- Bias: Be aware of potential biases in the data and take steps to mitigate them. Analyze the data distribution and consider the data collection methods to identify potential biases.

Data often comes in a messy, raw form that isn't ready to be used directly in building your AI model. Think of it like trying to cook a meal with ingredients that are still in their packaging and haven't been prepared. You need to open the packages, measure the ingredients, and mix them together before you can start cooking.


### Example:

*NOTE: This learning toolkit uses open-source datasets. Check the license of datasets as some will require professional datasets for commercial use*

**Case Scenario**
>
> 

In [1]:
# Pre-requsites: Ensure Anaconda and MySQL are running. Use the correct conda environment to run the code cells

# Import the necessary libraries
import matplotlib.pyplot as plt
import pandas as pd

# Load the CSV file into a DataFrame
# This first dataset is scraped from a Coffee review website www.coffeereview.com
file_path = 'learning-instructions-files/coffee_analysis.csv'
coffee_data = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
coffee_data.head()

# append text about other parameters (best with which brew type, with milk or sugar, )



Unnamed: 0,name,roaster,roast,roaster_country,origin_1,origin_2,100g_USD,rating,review_date,desc_1,desc_2,desc_3
0,“Sweety” Espresso Blend,A.R.C.,Medium-Light,Hong Kong,Panama,Ethiopia,14.32,95,Nov-17,"Evaluated as espresso. Sweet-toned, deeply ric...",An espresso blend comprised of coffees from Pa...,A radiant espresso blend that shines equally i...
1,Flora Blend Espresso,A.R.C.,Medium-Light,Hong Kong,Africa,Asia Pacific,9.05,94,Nov-17,"Evaluated as espresso. Sweetly tart, floral-to...",An espresso blend comprised of coffees from Af...,"A floral-driven straight shot, amplified with ..."
2,Ethiopia Shakiso Mormora,Revel Coffee,Medium-Light,United States,Guji Zone,Southern Ethiopia,4.7,92,Nov-17,"Crisply sweet, cocoa-toned. Lemon blossom, roa...",This coffee tied for the third-highest rating ...,"A gently spice-toned, floral- driven wet-proce..."
3,Ethiopia Suke Quto,Roast House,Medium-Light,United States,Guji Zone,Oromia Region,4.19,92,Nov-17,"Delicate, sweetly spice-toned. Pink peppercorn...",This coffee tied for the third-highest rating ...,Lavender-like flowers and hints of zesty pink ...
4,Ethiopia Gedeb Halo Beriti,Big Creek Coffee Roasters,Medium,United States,Gedeb District,Gedeo Zone,4.85,94,Nov-17,"Deeply sweet, subtly pungent. Honey, pear, tan...",Southern Ethiopia coffees like this one are pr...,A deeply and generously lush cup saved from se...



Data usually (if not always) needs to be transformed one way or a nother it to make it usable for a specific use case. This means changing its shape, format, or content to match the requirements of your model. For example, you might need to convert text data into numerical values, fill in missing data, or combine data from different sources.

We'll discuss data transformation as a part of the next Learning Outcome (Transform data for modeling using a data integration tool) but first, our data needs to be investigated through a process called **data exploration**.

#### (2C) Data Exploration

 Data exploration is the process of examining data to understand its characteristics, identify patterns, and uncover potential issues before further analysis or modeling. It's like getting to know a new dataset before working with it.   

Key techniques in data exploration include:
Data visualization: Creating visual representations of the data to identify trends, outliers, and relationships. This can be done using charts, graphs, and other visualization tools.   
Summary statistics: Calculating basic statistics like mean, median, mode, and standard deviation to understand data distribution.   
Data profiling: Analyzing data attributes to identify data types, missing values, and data quality issues.   
Correlation analysis: Examining relationships between different variables to identify potential dependencies.   

Why is data exploration important?
Understanding the data: Data exploration helps you gain a deeper understanding of the data, including its strengths, weaknesses, and potential limitations.   
Identifying data quality issues: By exploring the data, you can identify and address issues such as missing values, outliers, and inconsistencies.   
Informing data transformation: The insights gained from data exploration can guide your data transformation decisions, ensuring that the data is prepared effectively for modeling.
Developing hypotheses: Data exploration can help you generate hypotheses about the data that can be tested through further analysis.   
In essence, data exploration is a crucial step in the data analysis process that lays the foundation for more advanced techniques and insights.

---
## Practice Learning

Try answering [Practice Learning Activity 2](../ltk_learning-instructions-files/) yourself here before proceeding below.



// Answer to Practice learning 

// Outro

[Next: Case Study 2](../ltk_case-study/case-study-2.ipynb)
