#INFO-7390 --- Venkat Akash Varun Pemmaraju

## Understanding Data 

#### Data refers to raw facts, observations, measurements, or symbols that are typically collected and stored in a structured or unstructured format. 

#### Data can be virtually anything that can be recorded or observed, including text, numbers, images, sounds, and more.



<div style="text-align:center">
    <img src="https://visualstudiomagazine.com/-/media/ECG/visualstudiomagazine/Images/introimages/BigData.jpg" alt="Image Alt Text" width="500"/>
</div>



## Types of Data

<div style="text-align:center">
    <img src="https://intellspot.com/wp-content/uploads/2018/08/Types-of-Data-Infographic.png" alt="Image Alt Text" width="500"/>
</div>


#### We all know by now that data comes in various forms, here are some them : 

#### Numerical Data: Quantitative data representing values or counts. For example : Age of individuals in a survey.[ age, temperature, or salary].

In [4]:
# Numerical Data
age = [25, 30, 22, 35, 28]
temperature = 23.5
salary = [50000, 60000, 55000, 70000]

#### Discrete: Integer-based, like the number of students in a class.For example : Number of cars in a parking lot.


In [5]:
# Discrete Data
number_of_students = 40
number_of_cars = [50, 30, 20, 10]


#### Continuous: Any value within a range, For example the height of students.

In [6]:
# Continuous Data
height_of_students = 175.2


#### Categorical Data: Qualitative data representing categories or groups. For example, types of cuisine, blood groups, or movie genres.


In [7]:
# Categorical Data
cuisine_types = ["Italian", "Mexican", "Chinese"]
blood_groups = ["A", "B", "AB", "O"]
movie_genres = ["Action", "Comedy", "Drama", "Sci-Fi"]


#### Nominal: No inherent order, For example different types of fruits.

In [8]:
# Nominal Data
fruit_types = ["Apple", "Banana", "Orange"]


#### Ordinal: Has a logical order, For example rankings in a competition.

In [13]:
# Ordinal Data
competition_rankings = ["1st", "2nd", "3rd", "4th"]
print(competition_rankings)


['1st', '2nd', '3rd', '4th']


#### Time-Series Data: Data points indexed in time order, often used in forecasting. For example, stock market prices over time or daily temperatures.


In [11]:
# Time-Series Data
import pandas as pd
import datetime

stock_prices = pd.Series([100, 110, 95, 105], index=pd.date_range(start="2022-01-01", periods=4, freq="D"))
print(stock_prices)

2022-01-01    100
2022-01-02    110
2022-01-03     95
2022-01-04    105
Freq: D, dtype: int64


#### Text Data: Data in text format. Analyzing it often involves Natural Language Processing (NLP). For example, tweets or product reviews.

In [21]:
# Text Data
text_data = "This is a sample text for natural language processing (NLP)."

# Display Text
print(text_data)


This is a sample text for natural language processing (NLP).


#### Multimedia Data: Includes images, audio, and video data, often used in advanced fields like computer vision and speech recognition.


In [25]:
# Multimedia Data
# Note: In practice, handling multimedia data involves more complex code and libraries specific to each data type (e.g., image processing libraries for images).
from IPython.display import Image, display
image_data_url = "https://share-eric.eu/fileadmin/_processed_/b/5/csm_dataheader_ec9ee966be.jpg"
# Display Image
display(Image(url=image_data_url))

#### GPS/Geographic/Spatial Data: It refers to data that is associated with geographic locations or positions on the Earth's surface. It includes coordinates, maps, satellite images, and geographical information system (GIS) data. For Example: Coordinates of locations on a map.


In [20]:
# GPS/Geographic/Spatial Data
# Note: Coordinates are for illustration, actual data would involve more complex structures.
latitude, longitude = 37.7749, -122.4194


### What is Data Quality? Why is it important ?

Data Quality refers to the accuracy, completeness, reliability, and consistency of data. It is crucial because reliable data forms the foundation for effective decision-making, analysis, and business operations. Poor data quality can lead to errors, misinformation, and misguided decisions. 

#### Here are some examples : 

1. If including Michael Jordan's basketball earnings in a dataset of average school degree statistics distorts the analysis, you might choose to exclude it for a more accurate representation of typical salaries among degree holders.

2. When preparing a dataset for sentiment analysis, data cleaning includes identifying and eliminating duplicate user reviews to prevent bias and ensure a more balanced representation of opinions.

3.  if a company's customer database contains outdated or incorrect contact information, it may result in failed communication attempts and a loss of potential business opportunities.

<div style="text-align:center">
    <img src="https://4.bp.blogspot.com/-RmYtKB4gaOk/XCTBC_G-A-I/AAAAAAAAAFw/GegTV8LJGFkrs75v86wEuXSjmirWW3T8gCLcBGAs/s1600/Data%2BQuality%2Bin%2Bhindi.png" alt="Image Alt Text" width="500"/>
</div>

#### How can Assess Quality of our Data ?

#### What is Data Cleaning ?
Data Cleaning is the process of identifying and correcting errors, inconsistencies, inaccuracies, and missing values in a dataset to improve its quality and reliability. It is a crucial step in the data preparation phase before using the data for analysis or machine learning.

Data cleaning aims to enhance data quality through error identification and correction, ensuring consistency, completeness, accuracy, and uniformity. It involves handling missing values, removing duplicates, addressing outliers, formatting data, and maintaining documentation. Iterative and requiring domain knowledge, data cleaning is crucial for accurate analysis and reliable machine learning outcomes, ensuring the dataset is suitable for its intended purpose.

#### Let me give some examples why Data Cleaning is Important
### Consistency :
Converting timestamps to a standardized format (e.g., YYYY-DD-MM YYYY-MM-DD ) in a dataset for consistency and ease of analysis.

#### Example :
Original timestamp: "2023-01-15 08:30:45" Standardized formats:

YYYY-DD-MM: "2023-15-01"

YYYY-MM-DD: "2023-01-15"

### Outliers :
Deciding whether to exclude extreme values in a dataset measuring employee salaries to prevent distortion in average salary calculations.

#### Example :
If including Michael Jordan's basketball earnings in a dataset of average school degree statistics distorts the analysis, you might choose to exclude it for a more accurate representation of typical salaries among degree holders.

### Removing Duplicates
It includes identifying and eliminating duplicate records or entries to prevent distortion of analysis results and modeling outcomes.

#### Example :
When preparing a dataset for sentiment analysis, data cleaning includes identifying and eliminating duplicate user reviews to prevent bias and ensure a more balanced representation of opinions.

### Accuracy
It involves verifying the accuracy of data by cross-referencing it with reliable sources or validating it against known standards and rules.

#### Example
Validating sales figures against financial statements to ensure accuracy and reliability in business performance analyses.

### Treating Missing Values
Imputation is a statistical approach to fill in missing data by estimating values based on the available information, ensuring a more complete dataset for analysis or modeling

#### Example
Let's consider a dataset of ages with missing values:

Original Data: [25, 30, NaN, 22, 28, NaN, 35]

For continuous data (ages), you might choose to impute with the mean:

Imputed Data (mean): [25, 30, 28.33, 22, 28, 28.33, 35]

Here, NaN values are replaced with the mean age of the available data.

### EDA
Exploratory Data Analysis (EDA) is crucial in data cleaning as it helps identify anomalies, patterns, and outliers, guiding the cleaning process for more accurate insights

#### Example
While visualizing a histogram of a dataset may reveal unexpected spikes or gaps, prompting further investigation and cleaning of potentially erroneous data points.

### Normalization:
Normalization is used in data cleaning to scale numerical features, ensuring consistent ranges and preventing dominance of certain variables in models.

#### Example
normalizing income and age values to a common scale (e.g., between 0 and 1) helps avoid biased influence in machine learning algorithms.

### Encoding:
Encoding is applied in data cleaning to convert categorical variables into a numerical format suitable for analysis or modeling.

#### Example
 Encoding "Red," "Green," and "Blue" as 1, 2, and 3 allows algorithms to process color categories effectively during data analysis or machine learning tasks.
