**###Theory Concepts:**

**What is data analysis? Understanding why it's important**

In [None]:
ans:Data analysis is the process of cleaning, transforming, and interpreting raw data to uncover useful insights, patterns, and trends that support informed decision-making and problem-solving. It is important because it enables organizations to make smarter decisions based on facts rather than assumptions, understand customer behaviors and market trends, evaluate performance, and manage risks effectively.

What is Data Analysis?
Data analysis involves examining raw data through cleaning, organizing, and applying statistical or computational techniques to extract meaningful information. This process helps to identify trends, relationships, anomalies, and actionable insights from datasets.

Why Data Analysis is Important
Informed Decision-Making: By analyzing past and present data, organizations can make choices backed by evidence, which improves business outcomes.

Business Intelligence: It helps companies understand their customers, markets, and operational efficiency to stay competitive.
Risk Management: Predicts potential problems and helps mitigate risks before they escalate.

Performance Evaluation: Monitoring processes and strategies to identify what works and what needs improvement.

Innovation and Problem Solving: Data-derived insights encourage innovation and address underlying problems effectively.

Benefits Across Fields
Data analysis is crucial not only in business but also in healthcare, education, and public policy for driving informed strategies and optimizing performance.

This comprehensive role highlights why data analysis is a foundational skill for modern organizations aiming to leverage their data assets for competitive advantage and operational excellence.



**Pandas library: Excel-like operations in Python**

In [None]:
ans:

üêº Pandas Library: Excel-like Operations in Python

Pandas is a powerful Python library used for working with tabular data (rows & columns), just like you do in Microsoft Excel.

It allows you to:

Load data

Clean data

Analyze data

Filter and sort

Perform calculations

Create pivot tables

Merge sheets

All using Python code.

üìå 1. Reading and Writing Excel/CSV Files

Excel Equivalent: Opening and saving files

import pandas as pd

df = pd.read_excel("data.xlsx")   # Read Excel file
df.to_excel("output.xlsx")        # Save to Excel

üìå 2. Viewing Data

Excel Equivalent: Looking at rows

df.head()   # shows first 5 rows
df.tail()   # shows last 5 rows
df.shape    # number of rows & columns
df.info()   # column details

üìå 3. Selecting Columns (like selecting Excel columns)
df["Name"]          # single column
df[["Name","Age"]]  # multiple columns

üìå 4. Filtering Data (like using filters in Excel)
df[df["Age"] > 25]
df[df["City"] == "Hyderabad"]
df[(df["Age"] > 20) & (df["Gender"] == "F")]

üìå 5. Sorting Data (like Sort A‚ÜíZ or Z‚ÜíA)
df.sort_values("Salary")
df.sort_values("Salary", ascending=False)

üìå 6. Adding New Columns (like formulas in Excel)
df["Total"] = df["Quantity"] * df["Price"]

üìå 7. Handling Missing Values

Excel Equivalent: Cleaning blank cells

df.dropna()              # remove rows with empty values
df.fillna(0)             # replace empty values with 0

üìå 8. Grouping Data (like Pivot Tables / Group By in Excel)
df.groupby("Department")["Salary"].mean()
df.groupby("City")["Sales"].sum()

üìå 9. Merging & Joining (like VLOOKUP or combining sheets)
pd.merge(df1, df2, on="ID")

üìå 10. Pivot Tables (just like Excel Pivot Table)
df.pivot_table(values="Sales", index="City", aggfunc="sum")

‚≠ê Why Pandas is Better Than Excel

Can handle millions of rows

Automates repetitive tasks

Works great for data cleaning

Integrates with Python, SQL, ML

Saves time for analysts & data scientists

**Reading data from CSV files - like opening Excel files**

In [None]:
ans:

A CSV file (Comma-Separated Values) is similar to an Excel sheet but uses commas to separate values instead of cells.

Pandas makes it extremely easy to open CSV files.

‚úÖ 1. Import Pandas
import pandas as pd

‚úÖ 2. Reading a CSV File

This is like File ‚Üí Open ‚Üí Select Excel File in MS Excel.

df = pd.read_csv("data.csv")


‚úî data.csv = your file name
‚úî df = DataFrame (table structure like Excel)

üîç 3. Viewing the Data

Just like checking the first rows in Excel:

df.head()     # first 5 rows
df.tail()     # last 5 rows
df.shape      # rows, columns count
df.info()     # overview of columns

üìå Common Options While Reading CSV
1Ô∏è‚É£ If the CSV has a different separator

Example: values separated by ; instead of ,

df = pd.read_csv("file.csv", sep=";")

2Ô∏è‚É£ If the file has no headers
df = pd.read_csv("file.csv", header=None)

3Ô∏è‚É£ If the file has a column that should be the index

Like Excel row labels:

df = pd.read_csv("file.csv", index_col=0)

4Ô∏è‚É£ Reading only selected columns
df = pd.read_csv("file.csv", usecols=["Name", "Salary"])

5Ô∏è‚É£ Handling missing values while reading
df = pd.read_csv("file.csv", na_values=["?", "NA", "--"])

üßæ Reading Excel vs Reading CSV
Task	Excel File	CSV File
Read	pd.read_excel("file.xlsx")	pd.read_csv("file.csv")
Write	df.to_excel("output.xlsx")	df.to_csv("output.csv")
Structure	Multiple sheets	Only one sheet
Size	Larger	Smaller & faster
‚≠ê Simple Example

Your file: students.csv

Name,Age,Marks
Ravi,20,85
Priya,21,90
Arun,19,78


Python code:

import pandas as pd

df = pd.read_csv("students.csv")
print(df)


Output:

    Name  Age  Marks
0   Ravi   20     85
1  Priya   21     90
2   Arun   19     78


Just like you opened it in Excel!

**Exploring data: seeing basic information about your dataset**

In [None]:
ans:

üîç Exploring Data in Pandas

(Seeing basic information about your dataset)

Before analyzing data, the first step is to understand what your dataset contains.
Pandas provides several useful functions to quickly explore your data.

‚úÖ 1. View First Few Rows

Just like scrolling to the top in Excel.

df.head()


Shows the first 5 rows.

To show first 10 rows:

df.head(10)

‚úÖ 2. View Last Few Rows

Like scrolling to the bottom of Excel.

df.tail()

‚úÖ 3. Check Dataset Shape (Rows & Columns)

Like checking how big your Excel sheet is.

df.shape


Output example:

(1000, 6)


‚û° 1000 rows
‚û° 6 columns

‚úÖ 4. Get Summary of Dataset

Shows column names, data types, missing values, memory usage.

df.info()


This is like seeing Column Properties in Excel.

‚úÖ 5. Describe Numerical Columns

Gives statistics:

Count

Mean

Min

Max

Standard deviation

Quartiles

df.describe()


For both numbers + strings:

df.describe(include="all")

‚úÖ 6. List Column Names
df.columns

‚úÖ 7. Check Number of Missing Values
df.isnull().sum()

‚úÖ 8. View Sample Rows Randomly

Good for quick inspection.

df.sample(5)

üìå Summary Table
Task                    	Excel Equivalent	            Pandas Command
See top rows	                Scroll up	                   df.head()
See bottom rows	                Scroll down                    df.tail()
Size of sheet	                Count rows/columns	           df.shape
Column info                   	Column properties	           df.info()
Basic statistics	            Data analysis tools	           df.describe()
Column names	                Header row	                   df.columns
Missing values	                Filter empty cells	           df.isnull().sum()
‚≠ê Simple Example
import pandas as pd

df = pd.read_csv("sales.csv")

print(df.head())
print(df.info())
print(df.describe())


This gives you a quick overview of your dataset before detailed analysis.

**Data cleaning: fixing missing or wrong values**

In [None]:
ans:

üßπ Data Cleaning: Fixing Missing or Wrong Values

Data is rarely perfect. Many datasets contain:

Missing values

Wrong/inaccurate entries

Duplicates

Inconsistent formatting

Data cleaning is the process of fixing these issues so that your analysis becomes accurate.

‚úÖ 1. Handling Missing Values

Missing values appear as:
NaN, empty cells, None, ?, --

üîπ A. Find Missing Values
df.isnull().sum()


Shows how many missing values are in each column.

üîπ B. Remove Rows with Missing Values

Like deleting incomplete rows in Excel:

df.dropna()


To update the original dataframe:

df.dropna(inplace=True)

üîπ C. Fill Missing Values

Instead of deleting, you can replace missing values.

1Ô∏è‚É£ Fill with a constant (e.g., 0)
df.fillna(0)

2Ô∏è‚É£ Fill with mean of column

Useful for numerical data:

df["Age"].fillna(df["Age"].mean(), inplace=True)

3Ô∏è‚É£ Fill with median
df["Salary"].fillna(df["Salary"].median(), inplace=True)

4Ô∏è‚É£ Fill with mode (most repeated value)

Good for categorical data:

df["City"].fillna(df["City"].mode()[0], inplace=True)

‚úÖ 2. Fixing Wrong / Invalid Values

Sometimes values are incorrect:

Age = -5

Salary = 1,00,00,000 (too high)

City name spelled wrong: "Hyderbad" instead of "Hyderabad"

üîπ A. Replace wrong values
df["City"].replace("Hyderbad", "Hyderabad", inplace=True)

üîπ B. Remove impossible values

Example: Remove rows where age < 0

df = df[df["Age"] >= 0]

üîπ C. Fix outliers (very high/low values)

Replace salary > 5,00,000 with the median:

median_salary = df["Salary"].median()
df.loc[df["Salary"] > 500000, "Salary"] = median_salary

‚úÖ 3. Removing Duplicate Rows

Just like Excel's ‚ÄúRemove Duplicates‚Äù:

Check duplicates:
df.duplicated().sum()

Remove duplicates:
df.drop_duplicates(inplace=True)

‚úÖ 4. Fixing Incorrect Data Types

Example problems:

Age stored as text

Date stored as string

Convert column type
df["Age"] = df["Age"].astype(int)

Convert to datetime
df["Date"] = pd.to_datetime(df["Date"])

üéØ Summary of Data Cleaning Commands
Task	                            Command
Check missing values	           df.isnull().sum()
Drop missing rows	               df.dropna()
Fill missing values      	       df.fillna(value)
Fix wrong values	               df.replace(old,new)
Remove duplicates	               df.drop_duplicates()
Convert data type	               df.astype()
Convert to date	                   pd.to_datetime()
‚≠ê Example: Cleaning a Messy Dataset
import pandas as pd

df = pd.read_csv("employees.csv")

# Fix missing ages
df["Age"].fillna(df["Age"].mean(), inplace=True)

# Remove impossible salaries
df = df[df["Salary"] > 0]

# Correct spelling mistake
df["City"].replace("Hyderbad", "Hyderabad", inplace=True)

# Remove duplicate rows
df.drop_duplicates(inplace=True)

**Simple statistics: average, maximum, minimum values**

In [None]:
ans:


Pandas computes simple statistics like average (mean), maximum, and minimum on DataFrames or Series using built-in methods, ideal for quick insights after cleaning data.‚Äã

Basic Statistics on Columns
Apply to numeric columns for single values.

Average (Mean): df['Score'].mean() calculates the arithmetic mean, ignoring NaNs by default.‚Äã

Maximum: df['Score'].max() finds the highest value.‚Äã

Minimum: df['Score'].min() finds the lowest value.‚Äã

All at once: df['Score'].agg(['mean', 'max', 'min']) returns a summary Series.‚Äã

Statistics Across Entire DataFrame
Get stats for all numeric columns simultaneously.

Comprehensive summary: df.describe() shows count, mean, std, min, 25%, 50%, 75%, and max for all numerics.‚Äã

Custom aggregation: df.agg({'Score': ['mean', 'max', 'min'], 'Age': 'mean'}) targets specific columns and stats.‚Äã

On a student grades dataset, compute grades['Score'].mean() for class average, .max() for top score, and .min() for lowest after cleaning.

**###Hands-On Practice:**

**Install pandas library using simple pip command**

In [1]:
pip install pandas




In [2]:
import pandas as pd
print(pd.__version__)

2.1.4


**Download a simple dataset (like student marks or sales data)**

In [None]:
students.csv is the file which will be we are using it

**Load the dataset and display first few rows**

In [4]:
df = pd.read_csv('student_marks_dataset.csv') 
print(df.head(1))

  Student  Maths  Science  English
0    Ravi     85       80       75


**Check basic information: how many rows, columns, data types**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Student  5 non-null      object
 1   Maths    5 non-null      int64 
 2   Science  5 non-null      int64 
 3   English  5 non-null      int64 
dtypes: int64(3), object(1)
memory usage: 292.0+ bytes


In [7]:
rows, columns = df.shape
print(f"Rows: {rows}, Columns: {columns}")

Rows: 5, Columns: 4


**Find and handle missing values in the dataset**

In [8]:
import numpy as np


print("Missing values per column:")
print(df.isnull().sum())

# Total missing values
print(f"\nTotal missing: {df.isnull().sum().sum()}")

# Handle missing values - fill with mean for numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

# Drop rows with remaining missing values (if any)
df = df.dropna()

print("\nDataset after cleaning:")
print(df.shape)
print(df.head())

Missing values per column:
Student    0
Maths      0
Science    0
English    0
dtype: int64

Total missing: 0

Dataset after cleaning:
(5, 4)
  Student  Maths  Science  English
0    Ravi     85       80       75
1   Priya     90       95       88
2   Rahul     78       70       82
3   Sneha     92       89       91
4   Arjun     88       84       79


**Calculate average, highest, lowest values for numerical columns**

In [9]:
print("Statistics for numerical columns:")
print(df.describe())

# Or specific stats only
print("\nMean values:")
print(df.mean(numeric_only=True))
print("\nMax values:")
print(df.max(numeric_only=True))
print("\nMin values:")
print(df.min(numeric_only=True))

Statistics for numerical columns:
           Maths    Science    English
count   5.000000   5.000000   5.000000
mean   86.600000  83.600000  83.000000
std     5.458938   9.449868   6.519202
min    78.000000  70.000000  75.000000
25%    85.000000  80.000000  79.000000
50%    88.000000  84.000000  82.000000
75%    90.000000  89.000000  88.000000
max    92.000000  95.000000  91.000000

Mean values:
Maths      86.6
Science    83.6
English    83.0
dtype: float64

Max values:
Maths      92
Science    95
English    91
dtype: int64

Min values:
Maths      78
Science    70
English    75
dtype: int64


**üéØ Project:**

**Analyze a Simple Sales Dataset to find: total sales, best-selling product, and create a basic report**

In [25]:
# Load the attached Store-Transactions.csv dataset
df = pd.read_csv('Store-Transactions.csv')

# Basic dataset info
print("Dataset Info:")
print(f"Shape: {df.shape}")
print(df.head())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Clean data (drop rows with any missing values)
df_clean = df.dropna()

print(f"\nCleaned dataset shape: {df_clean.shape}")

# Basic Report (Customer subscription analysis since no sales columns)
print("\n=== STORE TRANSACTIONS REPORT ===")
print(f"Total Customers: {len(df_clean)}")
print(f"Date Range: {df_clean['Subscription Date'].min()} to {df_clean['Subscription Date'].max()}")

# Top countries by customer count
print("\nTop 5 Countries by Customers:")
print(df_clean['Country'].value_counts().head())

# Most common company names
print("\nTop 5 Companies by Subscriptions:")
print(df_clean['Company'].value_counts().head())

# Summary by year
df_clean['Subscription_Year'] = pd.to_datetime(df_clean['Subscription Date']).dt.year
print("\nSubscriptions by Year:")
print(df_clean['Subscription_Year'].value_counts().sort_index())


Dataset Info:
Shape: (100, 12)
   Index      Customer Id First Name Last Name  \
0      1  DD37Cf93aecA6Dc     Sheryl    Baxter   
1      2  1Ef7b82A4CAAD10    Preston    Lozano   
2      3  6F94879bDAfE5a6        Roy     Berry   
3      4  5Cef8BFA16c5e3c      Linda     Olsen   
4      5  053d585Ab6b3159     Joanna    Bender   

                           Company               City  \
0                  Rasmussen Group       East Leonard   
1                      Vega-Gentry  East Jimmychester   
2                    Murillo-Perry      Isabelborough   
3  Dominguez, Mcmillan and Donovan         Bensonview   
4         Martin, Lang and Andrade     West Priscilla   

                      Country                 Phone 1                Phone 2  \
0                       Chile            229.077.5154       397.884.0519x718   
1                    Djibouti              5153435776       686-620-1820x944   
2         Antigua and Barbuda         +1-539-402-0259    (496)978-3969x58947   
3    