<a href="https://colab.research.google.com/github/abdulabba0/data_analysis/blob/main/DataAnalytics1_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##**Data Collection, Integration, and Storage**

**Data Collection**:

Data collection is the process of gathering information from various sources to answer research questions, test hypotheses, and evaluate outcomes. This data is crucial for decision-making in business, research, and policy-making.


##**Techniques for gathering data:**


---



***Surveys***:
- Structured questionnaires used to collect quantitative data.
- Key for collecting large amounts of data efficiently.


In [2]:
import pandas as pd

In [4]:
# A sample survey dataset
data = {
    'RespondentID': [1, 2, 3, 4],
    'Age': [25, 30, 22, 35],
    'Satisfaction': [7, 9, 8, 6]
}

df = pd.DataFrame(data)
print(df)


   RespondentID  Age  Satisfaction
0             1   25             7
1             2   30             9
2             3   22             8
3             4   35             6


**Interviews:**

- Can be structured, semi-structured, or unstructured.
- Collect qualitative data providing depth and context.

In [None]:
# Pseudo-code for conducting interviews
interview_questions = [
    "How satisfied are you with our service?",
    "What improvements would you suggest?",
    "Can you describe your experience in detail?"
]

# Simulate interview responses
responses = [
    "I am very satisfied.",
    "I suggest more flexible hours.",
    "My experience has been great overall."
]

for question, response in zip(interview_questions, responses):
    print(f"Q: {question}\nA: {response}\n")


Q: How satisfied are you with our service?
A: I am very satisfied.

Q: What improvements would you suggest?
A: I suggest more flexible hours.

Q: Can you describe your experience in detail?
A: My experience has been great overall.



**Web Scraping:**

- Automated process to extract data from websites.
- Useful for collecting large datasets from web pages.

Two libraries are important for web scraping the request library and BeautifulSoup Library

**The requests library** is used for sending HTTP requests to a server and receiving responses. It allows you to:
- Send GET, POST, PUT, DELETE, and other types of requests
- Specify headers, parameters, and data in the request
- Receive the response content, status code, and headers
- Handle cookies and sessions
- The requests library is used to fetch web pages, APIs, or other online resources.


**BeautifulSoup:**
BeautifulSoup is a library used for parsing and scraping HTML and XML documents. It allows you to:
Parse HTML and XML documents into a tree-like structure
- Search and navigate the document using various methods (e.g., find, find_all, select)
- Extract data from the document, such as text, attributes, and tags
- Modify the document and create new content
BeautifulSoup is used to extract data from web pages, clean and transform the data, and perform various data analysis tasks.

***Use Cases:***
Some common use cases for requests and BeautifulSoup include:

- Web scraping: Extracting data from websites for analysis, monitoring, or automation.
- Data mining: Extracting large amounts of data from websites for research or business purposes.
- Automation: Automating tasks on websites, such as filling out forms or clicking buttons.
- Monitoring: Monitoring website changes, prices, or availability.
- Research: Extracting data for research purposes, such as sentiment analysis or trend analysis.

**Example Demonstration of web scraping - Example 1**

fetching all product name and price from an ecommerce store

In [None]:
import requests
from bs4 import BeautifulSoup

"""
The request libra

"""
url = 'https://scrapepark.org/' # website/url to scrape
response = requests.get(url) # getting the content of the website
soup = BeautifulSoup(response.content, 'html.parser') # using BeautifulSoup to parse the content
# so that you can start extracting information from them
# data = soup.find_all('div') # finding div tag on the website
# data = soup.find_all(attrs={"class" : "box"}) # finding by attribute, we find all div with a
# class attribute of box
# data = soup.find_all(class_="detail-box") # finding by class, we find all div with a class of detail-box
# data = soup.find_all(id="cart") # finding by id, we find the div with an id of cart
# data = soup.select("div.detail-box h5 span") #using css selector
"""
Example Usage
finding all product name and prices
"""
product_name = soup.select("div.detail-box h5 span")
product_price = soup.select("div.detail-box h6")

for pname, pprice in zip(product_name, product_price): #loop through and extract the text
  print([pname.text, pprice.text.strip()])


['New Skateboard', '$75']
['Used Skateboard', '$80']
['New Skateboard', '$68']
['Used Skateboard', '$70']
['New Skateboard', '$75']
['New Skateboard', '$58']
['New Skateboard', '$80']
['New Skateboard', '$35']
['New Skateboard', '$165']
['Used Skateboard', '$54']
['Used Skateboard', '$99']
['New Skateboard', '$110']


##**What is Representative Sampling**
**Representative Sampling** is a technique where the sample has the same characteristics as the larger population in key respects. This means that the findings from the sample can be generalized to the entire population with a higher degree of confidence.

##**Importance of Representative Sampling**

> **Generalizability**: Ensures that the study’s findings can be applied to the entire population.

> **Accuracy**: Reduces biases and errors, leading to more reliable results.

> **Efficiency**: Helps in making valid conclusions without having to survey the entire population.

##**Technique to achieve Representative Sampling**

> **Simple Random Sampling**: Every member of the population has an equal chance of being selected.

> **Stratified Sampling**: The population is divided into subgroups (strata) based on characteristics like age, gender, income, etc., and samples are drawn from each stratum.

> **Systematic Sampling**: Every nth member of the population is selected after a random starting point.

> **Cluster Sampling**: The population is divided into clusters, and entire clusters are randomly selected.

> **Multi-Stage Sampling**: A combination of two or more sampling methods, usually involving multiple stages of selection.


##**Data Gathering Process and Various Data Sources**

###**Steps to the Data Gathering Process**

The data gathering process involves a systematic approach to collecting data. Here are the steps:

1. Define the Objective
Clearly articulate the research question or objective
Identify the purpose of data collection
2. Identify Data Sources
Determine the type of data needed (qualitative, quantitative, or both)
Select appropriate primary or secondary data sources
3. Develop a Data Collection Plan
Choose appropriate data collection methods (e.g., surveys, interviews, observations)
Create data collection instruments (e.g., questionnaires, interview guides)
4. Collect Data
Execute the data collection plan
Gather data from selected sources
5. Store and Manage Data
Store data in a secure and accessible location
Organize and clean the data
6. Data Quality Check
Verify data accuracy and completeness
Ensure data consistency and reliability
7. Data Analysis
Apply appropriate data analysis techniques
Extract insights and meaningful patterns
8. Interpret and Report Results
Draw conclusions based on data analysis
Present findings in a clear and concise manner
By following these steps, you can ensure a systematic and effective data gathering process that yields reliable and useful data for analysis.

###**Data Sources:**

**Primary Sources:**
Primary data sources involve collecting original data directly from the source, tailored to specific research or analytics needs. Some common primary data sources include:
1. Surveys and Questionnaires
Online or offline questionnaires
Phone or in-person interviews
Paper or mobile surveys
2. Experiments and Tests
Controlled laboratory experiments
Field experiments (e.g., A/B testing)
User testing and usability studies
3. Observational Studies
Participant observation
Case studies
Ethnographic research
4. Sensors and IoT Devices
Temperature, motion, or location sensors
Log data from applications or systems
IoT devices (e.g., smart home devices)
5. Transactional Data
Sales data
Customer purchase history
Transactional logs (e.g., website interactions)
6. Focus Groups and Interviews
In-depth interviews
Focus groups and discussions
Expert interviews
**Secondary Sources:**
Secondary data sources are pre-existing data collections gathered by others, often for different purposes. These sources can be valuable for data analytics, providing insight and context. Some common secondary data sources include:
1. Public Sources
Government statistics and reports (e.g., census data, economic indicators)
International organizations (e.g., World Bank, WHO, IMF)
Public datasets (e.g., UCI Machine Learning Repository, Kaggle Datasets)
2. Academic Sources
Research papers and journals
Theses and dissertations
Academic books and textbooks
3. Commercial Sources
Market research reports (e.g., Euromonitor, Nielsen)
Industry associations and trade organizations
Company reports and financial statements
4. Online Sources
Social media platforms (e.g., Twitter, Facebook)
Online surveys and polls
Web scraping (extracting data from websites)
5. Internal Sources
Company databases and archives
Customer relationship management (CRM) systems
Enterprise resource planning (ERP) systems
Remember to evaluate the credibility and relevance of secondary data sources for your specific analytics needs.

In [None]:
survey_questions = [
    "How old are you?",
    "How satisfied are you with our service?",
    "What improvements would you suggest?"
]

print("Survey Questions:")
for question in survey_questions:
    print(f"- {question}")


Survey Questions:
- How old are you?
- How satisfied are you with our service?
- What improvements would you suggest?


##**Aggregate Data from Multiple Sources and Integrate Them into Datasets**

**Techniques:**

- Using SQL queries to combine tables from different databases.-
- Fetching data programmatically from web APIs.
- Merging CSV or Excel files.

**Challenges:**

Data Format Disparities: Different formats requiring conversion.
Alignment Issues: Ensuring data consistency and accuracy.


## **Various Data Storage Solutions**

1. Data Warehouses:
- Structured storage for easy analysis.
- Example: Amazon Redshift, Google BigQuery.

2. Data Lakes:
- Store raw data for future use.
- Example: Hadoop, Amazon S3.

3. File-Based Storage:
- CSV, Excel for smaller datasets.

4. Cloud Storage Solutions:
- Scalable and accessible storage options.
- Example: AWS S3, Google Cloud Storage.

5. Network-Based Storage: This allows multiple computers to access storage through a network, making it better for data sharing and collaboration.
- Example: server - several computers connected to a server can store access and store their data on that server

6. Backup Storage: This protects data loss from disaster, failure or fraud by making periodic data and application copies to a separate, secondary device.

7. Object Storage: This stores large amounts of unstructured data, such as emails, videos, photos, web pages, audio files, sensor data and other types of media.
- Example : Amazon S3 (Simple Storage Service)
Microsoft Azure Blob Storage

##**Categories of Data Use for Analysis and Machine Learning**
1. Structured Data:
- Organized and formatted data
- Easily searchable and machine-readable
- Typically stored in databases or spreadsheets
**Examples:**
- Customer information (name, address, phone number)
- Sales data (product, price, date)
- Sensor readings (temperature, humidity, timestamp)

2. Unstructured Data:
- Unorganized and unformatted data
- Difficult to search and analyze using traditional methods
- Typically stored in files or documents
**Examples:**
Social media posts
Emails
- Images and videos
- Audio files
- Text documents (reports, articles, books)

**Key differences:**
- Format: Structured data has a predefined format, while unstructured data lacks a standardized format.
- Searchability: Structured data is easily searchable, while unstructured data requires natural language processing or other specialized techniques.
- Storage: Structured data is typically stored in databases, while unstructured data is stored in files or documents.

``Understanding the difference between structured and unstructured data is crucial for effective data management, analysis, and decision-making.``



1. Structured Data:
- Organized data (e.g., databases, spreadsheets).
Easier to analyze using SQL or data frames.

In [None]:
import pandas as pd

structured_data = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

print(structured_data)


2. Unstructured Data:

Free-form data (e.g., text, images).
Requires preprocessing for analysis.

In [None]:
import librosa
import pandas as pd

audio, sr = librosa.load("audio_clip.wav")
y, sr_native = librosa.load
data = {"Audio": [audio]}
df = pd.DataFrame(data)
print(df)

  audio, sr = librosa.load("audio_clip.wav")
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


FileNotFoundError: [Errno 2] No such file or directory: 'audio_clip.wav'

##**Types of Data for Data Analysis and Machine Learning**

Data comes in various forms, and understanding the types of data is crucial for effective data analysis and machine learning. Here are the main types of data:
1. Numerical Data (Quantitative)
Continuous (e.g., height, weight, temperature)
Discrete (e.g., age, number of children)
2. Categorical Data (Qualitative)
Nominal (e.g., gender, color, religion)
Ordinal (e.g., education level, socioeconomic status)
3. Text Data
Unstructured (e.g., social media posts, reviews)
Semi-structured (e.g., XML, JSON)
4. Image and Video Data
Visual data (e.g., images, videos)
5. Time Series Data
Sequential data (e.g., stock prices, sensor readings)
6. Graph Data
Network data (e.g., social connections, web graphs)
7. Audio Data
Speech, music, or other audio files
8. Sensor Data
IoT data (e.g., temperature, motion, location)
9. Transactional Data
Sales data, customer purchases, and transactions
These data types can be used in various machine learning tasks, such as:
- Regression (numerical data)
- Classification (categorical data)
- Clustering (numerical and categorical data)
- Natural Language Processing (text data)
- Computer Vision (image and video data)
- Time Series Forecasting (time series data)

Understanding the data type is essential for selecting appropriate algorithms and techniques in data analysis and machine learning.

##**Discrete and Continous Data**
1. **Continuous data** is a type of numerical data that can take on any value within a certain range or interval. It is typically measured or calculated to a high degree of precision, and can be further subdivided into smaller increments.
Examples of continuous data include:
- Measurements:
-- Height (e.g., 175.5 cm)
-- Weight (e.g., 73.2 kg)
-- Temperature (e.g., 23.4°C)
- Time:
-- Duration (e.g., 2.5 hours)
-- Time of day (e.g., 14:37)
- Financial:
-- Stock prices (e.g., $123.45)
-- Exchange rates (e.g., 1.2345 EUR/USD)
- Sensor readings:
-- Blood pressure (e.g., 120/80 mmHg)
-- GPS coordinates (e.g., 37.7749° N, 122.4194° W)

```Continuous data is typically stored as floating-point numbers (e.g., float or double) and can be analyzed using various statistical and machine learning techniques, such as regression, clustering, and density estimation.```

```In pandas, continuous data is often stored in numeric columns (e.g., int64, float64) and can be manipulated using various methods, such as filtering, grouping, and sorting.```

2. **Discrete Data**
Discrete data is a type of data that can only take on specific, distinct values. It is typically counted or enumerated, and each value is separate and distinct from others. Discrete data can be thought of as "chunky" or "granular", with clear gaps between each value.
Examples of discrete data include:
- Categorical data:
-- Gender (male/female)
-- Color (red/green/blue)
-- Religion (Christian/Muslim/Hindu)
- Ordinal data:
-- Education level (high school/college/graduate degree)
-- Socioeconomic status (low/middle/high)
-- Movie ratings (1/2/3/4/5 stars)
- Count data:
-- Number of children
-- Number of cars owned
-- Number of votes cast
- Nominal data:
-- Names (John/Mary/David)
-- IDs (123/456/789)
-- Codes (A/B/C/D)

```Discrete data is often stored as integer or categorical columns in pandas and can be analyzed using various statistical and machine learning techniques, such as classification, clustering, and frequency analysis.
In contrast to continuous data, discrete data has distinct gaps between each value, making it suitable for counting, categorizing, and enumerating.```

###**Difference between Discrete and Continuous Data**

**Discrete Data**
- Nature: Countable; takes specific, distinct values.
- Values: Only specific values, usually whole numbers, with gaps between them.

***Examples:***

i. Number of students in a class
ii. Number of cars in a parking lot
iii. Number of books on a shelf

- Measurement: Finite or countably infinite.
- Representation: Bar graphs, pie charts.
- Typical Use: Counting distinct items or occurrences.

**Continuous Data**
- Nature: Measurable; can take any value within a given range.
- Values: Any value within a range, including fractions and decimals, with no gaps between them.

Examples:
Height of a person.
Weight of an object.
Temperature of a room.

- Measurement: Infinitely divisible.
- Representation: Histograms, line graphs.
- Typical Use: Measuring characteristics that can vary smoothly over a range.


**Summary of Key Differences:**
- Countable vs. Measurable: Discrete data is countable, whereas continuous data is measurable.
- Gaps Between Values: Discrete data has gaps between values, while continuous data does not.
- Type of Values: Discrete data typically involves whole numbers; continuous data can include fractions and decimals.
- Graphical Representation: Discrete data is often shown with bar graphs or pie charts; continuous data is often shown with histograms or line graphs.