**WHAT is data analysis? Understanding why it's important**


ANS:
At its heart, data analysis is the process of cleaning, transforming, and modeling data to discover useful information and support decision-making.

Why it’s important:

Identifying Patterns: Seeing trends that aren't obvious in raw numbers.

Informed Decisions: Moving from "gut feelings" to evidence-based strategies.

Predictive Power: Using historical data to guess what might happen next.

**Pandas library: Excel-like operations in Python**

**ans:**
Pandas is a powerful Python library specifically designed for data manipulation. While Excel uses Workbooks and Sheets, Pandas uses a structure called a DataFrame.


Feature                   Excel                        Pandas               
Data Container         Spreadsheet (.xlsx)            DataFrame
Rows                   Numbered (1, 2, 3...)          Index (0, 1, 2...)
Columns                Lettered (A, B, C...)          Column Names
Automation             Macros/VBA                     Python Scripts (Reusable)
Data Size              Limited (~1M rows)             Limited only by computer memory

**Reading data from CSV files - like opening Excel files**

**ans**

In Excel, you double-click a file to open it. In Python, you "read" it into a variable.
A CSV (Comma Separated Values) file is the most common format because it's lightweight and works everywhere.


Python

import pandas as pd

This is like opening a file in Excel
df = pd.read_csv('your_data.csv')

**Exploring data: seeing basic information about your dataset**

**ans**

Before you start analyzing, you need to see what you’re working with. Instead of scrolling through thousands of rows, you use quick commands:

df.head(): Shows the first 5 rows (the "preview").

df.info(): Tells you the data types (numbers vs. text) and if any data is missing.

df.shape: Tells you how many rows and columns exist.

**Data cleaning: fixing missing or wrong values**

**ans**

Data is rarely perfect. In Excel, you might manually delete blank cells; in Pandas, you use code to handle them consistently:

 -Handling Missing Values: You can either drop rows with missing data or fill them with a value (like the average).

-Fixing Data Types: Ensuring a "Price" column is treated as a number, not text.

-Removing Duplicates: One command can scan millions of rows for repeats.


**Simple statistics: average, maximum, minimum values**

**ans**

Once the data is clean, you can extract insights instantly. Pandas can calculate statistics across the entire dataset at once:

df['Column'].mean(): The average.

df['Column'].max(): The highest value.

df['Column'].min(): The lowest value.

df.describe(): A "magic" command that gives you a summary of the count, mean, standard deviation, and percentiles for every numerical column.

# Hands-On Practice:

In [5]:
## Install pandas library using simple pip command

!pip install pandas



In [None]:
## Download a simple dataset (like student marks or sales data)

sales_data.csv is there

In [6]:
##Load the dataset and display first few rows
import pandas as pd

# 1. Load the dataset into a variable called 'df' (short for DataFrame)
df = pd.read_csv('sales_data.csv')

# 2. Display the first few rows
print("Top rows of the dataset:")
print(df.head())

Top rows of the dataset:
         Date     Product  Quantity  Price Customer_ID Region  Total_Sales
0  2024-01-01       Phone         7  37300     CUST001   East       261100
1  2024-01-02  Headphones         4  15406     CUST002  North        61624
2  2024-01-03       Phone         2  21746     CUST003   West        43492
3  2024-01-04  Headphones         1  30895     CUST004   East        30895
4  2024-01-05      Laptop         8  39835     CUST005  North       318680


In [9]:
#Check basic information: how many rows, columns, data types
df.columns

Index(['Date', 'Product', 'Quantity', 'Price', 'Customer_ID', 'Region',
       'Total_Sales'],
      dtype='object')

In [10]:
df.shape

(100, 7)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Date         100 non-null    object
 1   Product      100 non-null    object
 2   Quantity     100 non-null    int64 
 3   Price        100 non-null    int64 
 4   Customer_ID  100 non-null    object
 5   Region       100 non-null    object
 6   Total_Sales  100 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 5.6+ KB


In [12]:
# Find and handle missing values in the dataset
df.isnull().sum()
# Check which columns have missing values

Date           0
Product        0
Quantity       0
Price          0
Customer_ID    0
Region         0
Total_Sales    0
dtype: int64

In [13]:
# Calculate average, highest, lowest values for numerical columns
df.describe()  # Get a summary of all numeric columns

Unnamed: 0,Quantity,Price,Total_Sales
count,100.0,100.0,100.0
mean,4.78,25808.51,123650.48
std,2.588163,13917.630242,100161.085275
min,1.0,1308.0,6540.0
25%,2.75,14965.25,39517.5
50%,5.0,24192.0,97955.5
75%,7.0,38682.25,175792.5
max,9.0,49930.0,373932.0


In [17]:
# Select numerical columns (automatically filters for int and float types)
numerical_cols = df.select_dtypes(include=['number'])

# Calculate average, highest (max), and lowest (min) values
stats = numerical_cols.agg(['mean', 'max', 'min']).transpose()

# Rename columns for clarity
stats.columns = ['Average', 'Highest', 'Lowest']

# Print the results
print(stats)

               Average   Highest  Lowest
Quantity          4.78       9.0     1.0
Price         25808.51   49930.0  1308.0
Total_Sales  123650.48  373932.0  6540.0
