# Class 11: Working with Real-World Data

Welcome to the eleventh class of our Python course! In this class, we will focus on working with real-world data. You'll learn how to import data from various formats, clean the data, and perform exploratory data analysis (EDA) to gain insights. Let's dive in!

## 1. Importing Data

One of the first steps in data analysis is importing data into your environment. Pandas makes it easy to import data from various formats such as CSV, Excel, and more.

### 1.1. Reading Data from CSV

CSV (Comma-Separated Values) is a common format for storing tabular data. Here's how you can read a CSV file in Google Colab:

In [None]:
import pandas as pd

# Reading a CSV file
url = 'https://example.com/data.csv'  # Replace with your actual URL or file path
df = pd.read_csv(url)
print(df.head())

### 1.2. Reading Data from Excel

Excel is another widely used format. Pandas provides `read_excel` for reading Excel files.

In [None]:
# Reading an Excel file
url = 'https://example.com/data.xlsx'  # Replace with your actual URL or file path
df = pd.read_excel(url, sheet_name='Sheet1')
print(df.head())

### 1.3. Reading Data from Other Formats

Pandas can also read data from other formats, such as JSON, HTML, and more.

In [None]:
# Reading a JSON file
url = 'https://example.com/data.json'  # Replace with your actual URL or file path
df = pd.read_json(url)
print(df.head())

## 2. Data Cleaning

After importing the data, the next step is to clean it. Real-world data is often messy, containing duplicates, missing values, and outliers.

### 2.1. Handling Duplicates

Duplicates can skew your analysis. Pandas makes it easy to identify and remove duplicates.

In [None]:
# Creating a DataFrame with duplicates
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Age': [25, 30, 35, 25],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York']
}
df = pd.DataFrame(data)

# Identifying duplicates
duplicates = df.duplicated()
print("Duplicates:\n", duplicates)

# Removing duplicates
df_cleaned = df.drop_duplicates()
print(df_cleaned)

### 2.2. Handling Outliers

Outliers are values that are significantly higher or lower than the rest of the data. They can affect the results of your analysis.

**Identifying Outliers:**

In [None]:
import numpy as np

# Creating a DataFrame with an outlier
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 100]  # 100 is an outlier
}
df = pd.DataFrame(data)

# Identifying outliers using z-score
df['zscore'] = (df['Age'] - df['Age'].mean()) / df['Age'].std()
outliers = df[df['zscore'].abs() > 2]  # Considering z-score > 2 as outliers
print(outliers)

**Handling Outliers:**

You can either remove outliers or replace them with a specific value.

In [None]:
# Removing outliers
df_no_outliers = df[df['zscore'].abs() <= 2]
print(df_no_outliers)

# Replacing outliers with median
median_age = df['Age'].median()
df['Age'] = np.where(df['zscore'].abs() > 2, median_age, df['Age'])
print(df)

## 3. Exploratory Data Analysis (EDA)

EDA is the process of analyzing data sets to summarize their main characteristics, often using visual methods.

### 3.1. Descriptive Statistics

Descriptive statistics summarize the central tendency, dispersion, and shape of a dataset’s distribution.

In [None]:
# Descriptive statistics
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [70000, 80000, 90000, 100000]
})
print(df.describe())

### 3.2. Visual Exploratory Analysis

Visualizing data helps in identifying patterns, trends, and outliers.

In [None]:
import matplotlib.pyplot as plt

# Creating a scatter plot
plt.scatter(df['Age'], df['Salary'])
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()

# Creating a histogram
df['Age'].plot(kind='hist', bins=5, title='Age Distribution')
plt.xlabel('Age')
plt.show()

# Creating a box plot
df['Salary'].plot(kind='box', title='Salary Distribution')
plt.ylabel('Salary')
plt.show()