# A beginner-friendly notebook to learn EDA!

**Created by:** Vinaya Sangeeta Lahari Baswa  
Sophomore Student  @GITAM Deemed University - Visakhapatnam   

[GitHub](https://github.com/bvslahari007)
| [LinkedIn](https://www.linkedin.com/in/vinaya-sangeeta-lahari-baswa-027892316/)
| [Kaggle](https://www.kaggle.com/bvslahari)

A note from your fellow learner!

_This notebook is not just a code dump. It’s my hands-on guide as I learned EDA from scratch using real data and real curiosity. If you’re a beginner like me, I hope this helps you too_ <3  

And yeah one more thing...

_This notebook is part of my personal learning journey in AI.  
Some of the markdowns or comments may feel casual or funny — that’s because I wrote them to help myself understand better._

_I’ve kept them in here on purpose so other beginners like me can understand and learn too!_ 

_So please feel free to treat this as a fun, beginner-level hands-on guide, not a polished textbook_


## EDA is the process of looking at your data before doing anything else.
It helps you understand what kind of data you have, what it looks like, and what it might be telling you. It’s about:

- Exploring the data
- Understanding its patterns
- Finding errors or surprises
- Visualizing it in different ways

| Tool           | Why It Matters                                           |
| -------------- | -------------------------------------------------------- |
| **Pandas**     | Foundation for handling dataframes                |
| **Seaborn**    | Easy, beautiful plots                        |
| **Matplotlib** | Fine-tuning plots                   |
| **NumPy**      | Under the hood for numbers & arrays  |

## How to Add a Dataset in a Kaggle Notebook

If you're new to Kaggle, here’s how to add a dataset to your notebook:

### 1. Open Your Notebook
Start a new notebook or open an existing one from your Kaggle account.

### 2. Click "Add Data" on the Right Sidebar
On the right side of the notebook editor, click the "Add Data" button.  
This opens a search window where you can find datasets.

### 3. Search for the Dataset
Use the search bar to type the name of the dataset you want to add (e.g., Titanic, Netflix Movies, Student Performance, etc.).

### 4. Add the Dataset
Once you find the dataset, click on it and then click the "Add" button.  
It will now be attached to your notebook.

### 5. Access the Dataset in Your Code
Kaggle stores datasets in a directory like this:
```python
df = pd.read_csv("../input/titanic/train.csv")
```



#### Dataset I used in this notebook:  
[Click here to view the dataset](https://www.kaggle.com/competitions/titanic)

So, real talk.

I'm still figuring out what's "best" in the world of data science tools, but for now, I personally recommend using Kaggle Notebooks if you're learning Exploratory Data Analysis (EDA) for the first time.

Why? Because everything just works.

No installing pandas.  
No updating matplotlib.  
No random errors about missing dependencies.  
No crying at 1 AM because "pip install" betrayed you.

Kaggle gives you:
- A clean notebook interface  
- Most popular libraries (pandas, numpy, seaborn, matplotlib) pre-installed  
- Access to tons of public datasets  
- And zero setup drama

If you're learning for fun, for class, or to eventually get into AI or data science — this is a great place to start. You focus on learning, not fixing.

So yeah, this isn't expert advice, but it's real beginner advice — from someone who’s also learning and just wanted fewer headaches along the way.


## Understanding a Dataset  
`.head()`, `.info()`, and `.shape` help us understand the size and structure of the data.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("/kaggle/input/titanic/train.csv")

In [None]:
print(df.head())

In [None]:
print(df.shape)

In [None]:
print(df.info())

| Code            | What It Does                                                    |
| --------------- | --------------------------------------------------------------- |
| `df.shape`      | Tells you how many rows and columns (e.g., `(100, 5)`)          |
| `df.columns`    | Lists all the column names                                      |
| `df.info()`     | Shows types of data and how many non-null (non-missing) values  |
| `df.describe()` | Gives summary statistics (mean, std, min, max etc. for numbers) |

## Types of Data & Summary Statistics  

We check data types and use `.describe()` to get basic stats like mean, min, and max for each numeric column.


In [None]:
print(df.describe())

In [None]:
print(df['Age'].describe())

## Value Counts (for Categorical Columns
Shows how often each category appears (like how many males vs females).

In [None]:
df['Age'].value_counts()

In [None]:
df['Sex'].value_counts(normalize = True)

In [None]:
print(df["Age"].unique())

In [None]:
df.describe(include="all")

In [None]:
print(df["Fare"].describe())
print(df["Sex"].value_counts())
print(df["Embarked"].unique())

## Missing Values & Duplicates  
We identify and handle missing data and duplicates to clean the dataset before analysis.



In [None]:
df.isnull().sum()

In [None]:
df["Age"] = df["Age"].fillna(df["Age"].mean())

In [None]:
df['Cabin'] = df["Cabin"].fillna("Unknown")

In [None]:
df['Embarked'] = df['Embarked'].fillna(df["Embarked"].mode()[0])

#mode() gives the most frequent value in common. It return a series so we use [0] beside to take the first element in the series

| Code                       | What it means                                          |
| -------------------------- | ------------------------------------------------------ |
| `df["Embarked"].mode()`    | gives most frequent value(s) in the column             |
| `df["Embarked"].mode()[0]` | gives the first (main) most frequent value, like `'S'` |



## Univariate Analysis (One Column at a Time)  
 _Univariate analysis = analyzing just one column_  

Helps us understand the distribution of a single column like how many passengers were in each gender or age group.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.countplot(x = 'Sex', data = df)
plt.title('Count of Passengers by Sex')
plt.show()
# A bar graph showing how many males vs females

In [None]:
sns.countplot(x="Embarked", data=df)
plt.title("Embarked Port Distribution")

In [None]:
#Optional Customization
sns.countplot(x="Embarked", data=df, palette="Set2")
plt.title("Embarked Port Distribution")

In [None]:
#bins = 30 ---> controls smoothness
#kde = True ---> adds a curved line to show distribution

sns.histplot(df["Age"], bins = 30, kde = True)
plt.title("Age Distribution of Passengers")
plt.show()

In [None]:
sns.boxplot(x = 'Age', data = df)
plt.title('Boxplot of Age')
plt.show()

The big box thingy represents the Interquartile Range (IQR), which shows where most of the data falls.

This shows:

- Median (middle)

- Quartiles (25%, 75%)

- Outliers (dots beyond whiskers)


| What You Want           | Use This Code                     |
| ----------------------- | --------------------------------- |
| Count categories        | `sns.countplot(x="Sex", data=df)` |
| Distribution of numbers | `sns.histplot(df["Age"])`         |
| Spot outliers           | `sns.boxplot(x=df["Fare"])`       |


## Bivariate Analysis (2 Columns Together)  
Helps us compare two columns like survival rate by gender or class.


In [None]:
sns.countplot(x = 'Sex', hue = 'Survived', data = df)
plt.title("Survival Count by Sex")
plt.show()

In [None]:
sns.boxplot(x = 'Survived', y = 'Age', data = df)
plt.title("Age Distribution by Survival")
plt.show()

In [None]:
sns.histplot(data=df, x="Age", hue="Survived", multiple="stack")

In [None]:
sns.scatterplot(x="Age", y="Fare", data=df)
plt.title("Age vs Fare Paid")

If Univariate was:  
“How many students got each grade?”  

Then Bivariate is:  
“Did boys score higher than girls?”  
“Do people who study more hours get better marks?”

| Type of Data               | Tool                            |
| -------------------------- | ------------------------------- |
| Categorical vs Categorical | `sns.countplot(x="A", hue="B")` |
| Categorical vs Numerical   | `sns.boxplot(x="A", y="B")`     |
| Numerical vs Numerical     | `sns.scatterplot(x="A", y="B")` |

Boxplots and Histograms helps us compare two columns like survival rate by gender or class.


## Correlation & Heatmaps  
Correlation matrix shows how strongly numerical columns are related to each other, with values from -1 to +1.

Heatmap is a colorful visual of the correlation matrix, helps quickly spot relationships.



In [None]:
df.corr(numeric_only = True)

This shows a table of how every numeric column is correlated with every other numeric column.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

corr_matrix = df.corr(numeric_only=True)
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()


* Creates a grid
* Colors show strength of correlation
* Red = strong positive
* Blue = strong negative
* Numbers in each box show exact correlation values


In [None]:
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, cmap="YlGnBu")

## Ending...

Before this, I never knew how much EDA could reveal about data. From basic `.head()` to heatmaps, this journey taught me to "talk to the data" before modeling.

## Feedback Welcome

This notebook is my honest attempt to learn step-by-step, and I’ve shared it in case it helps someone else starting their journey too. I’d genuinely love to hear any recommendations, corrections, or suggestions for improvement.  
If you’re reading this and notice something I could do better — feel free to let me know. I’m here to learn and grow.

Thank youuuu!


## License

This notebook is shared for learning purposes.  
Please do not copy or re-upload it as your own.  
You're welcome to fork it or take inspiration, but give credit where it's due.  
© Vinaya Sangeeta Lahari Baswa, 2025.
