

### 📘 Creating DataFrames in Pandas

A **DataFrame** is the primary data structure in Pandas, used extensively in data science for data manipulation and analysis.

#### 🧱 1. From Python Lists

DataFrames can be created using a list of lists where each inner list represents a row of data. Column names are explicitly specified for clarity.

#### 📚 2. From Dictionary of Lists

This is the most readable and commonly used format. Each key becomes a column, and the associated list holds the data for that column.

#### 🔢 3. From NumPy Arrays

When working with numerical data, DataFrames can be created from NumPy arrays. It's important to provide column names when doing so.

#### 📂 4. From CSV Files

CSV files are widely used for storing tabular data. Pandas can read them directly. Useful options include `sep`, `header`, `names`, `index_col`, `usecols`, and `nrows`.

#### 📊 5. From Excel Files

Excel files can also be read into Pandas. You may need to install dependencies like `openpyxl` or `xlrd` for Excel support.

#### 🌐 6. From JSON

JSON data can be read into a DataFrame either from a file, string, or URL. It's useful for nested or structured data sources.

#### 🗄️ 7. From SQL Databases

Pandas can interface with SQL databases to extract data using SQL queries, making it convenient to work with large datasets stored in relational databases.

#### 🌍 8. From the Web

Data can be loaded directly from web URLs, such as CSV files hosted online. This is useful for working with public datasets in real time.

---

### 🔍 Exploratory Data Analysis (EDA)

EDA is the process of examining a dataset to summarize its main characteristics. It helps uncover insights, detect anomalies, and understand relationships between variables before applying models.

Key EDA steps:

* Generating summary statistics
* Checking data types and missing values
* Identifying duplicates and outliers
* Creating visualizations like histograms, box plots, and scatter plots

---

### 🧭 Essential EDA Commands (Theory)

* **View Structure**: Understand rows, columns, and data types
* **Summary Stats**: Get statistical insights for numeric columns
* **Column Overview**: Review all column names
* **Quick Look**: Check the first and last few rows for a quick sense of data quality

---

### ✅ Summary

* DataFrames can be created from various sources: lists, dictionaries, arrays, files, databases, or the web.
* EDA helps you clean, understand, and prepare your data effectively before analysis or modeling.



In [1]:
import pandas as pd 

In [5]:
data = [["esha", 0],["revati", 15],["isha", 10]]

In [6]:
data

[['esha', 0], ['revati', 15], ['isha', 10]]

In [39]:
pd.DataFrame(data, columns=["name", "Marks"])

Unnamed: 0,name,Marks


In [11]:
data = {"a" :[2,3,4,5], "b":[23,34,45]}

In [12]:
data

{'a': [2, 3, 4, 5], 'b': [23, 34, 45]}

In [16]:
import numpy as np

In [17]:
arr = np.array([[1,2],[5,6]])

In [19]:
df=pd.DataFrame(arr,columns=["a","b"])

In [20]:
df

Unnamed: 0,a,b
0,1,2
1,5,6


In [27]:
pd.read_excel("Book1.xlsx")

Unnamed: 0,Name,School,Marks
0,esha,WWW,12
1,mudabbir,UT,23
2,jack,ST,33
3,rishi,DPS,44
4,aakash,BBL,55


In [29]:
df = pd.read_csv("train_outliers_preprocessed.csv")

In [30]:
df

Unnamed: 0,age,sex,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52.0,Female,Female,120.0,125.0,0.0,0.0,152.0,0.0,0.0,1.0,2.0,2.0,1.0,0.0
1,53.0,0,Female,140.0,203.0,0.0,0.0,155.0,0.0,3.1,0.0,0.0,3.0,0.0,0.0
2,70.0,0,Female,145.0,174.0,0.0,1.0,125.0,0.0,2.6,0.0,0.0,3.0,0.0,0.0
3,61.0,0,Female,148.0,203.0,0.0,1.0,161.0,0.0,0.0,2.0,1.0,3.0,0.0,0.0
4,62.0,0,Female,138.0,294.0,0.0,1.0,152.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59.0,1,Male,140.0,204.0,0.0,1.0,152.0,0.0,0.0,2.0,0.0,2.0,1.0,1.0
1021,60.0,0,Female,120.0,204.0,0.0,1.0,152.0,0.0,2.8,1.0,1.0,3.0,0.0,0.0
1022,47.0,0,Female,120.0,204.0,0.0,0.0,118.0,0.0,1.0,1.0,1.0,2.0,0.0,0.0
1023,50.0,0,Female,120.0,204.0,0.0,0.0,159.0,0.0,0.0,2.0,0.0,2.0,1.0,1.0


In [31]:
df.tail()

Unnamed: 0,age,sex,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
1020,59.0,1,Male,140.0,204.0,0.0,1.0,152.0,0.0,0.0,2.0,0.0,2.0,1.0,1.0
1021,60.0,0,Female,120.0,204.0,0.0,1.0,152.0,0.0,2.8,1.0,1.0,3.0,0.0,0.0
1022,47.0,0,Female,120.0,204.0,0.0,0.0,118.0,0.0,1.0,1.0,1.0,2.0,0.0,0.0
1023,50.0,0,Female,120.0,204.0,0.0,0.0,159.0,0.0,0.0,2.0,0.0,2.0,1.0,1.0
1024,54.0,0,Female,120.0,204.0,0.0,1.0,113.0,0.0,1.4,1.0,1.0,3.0,0.0,0.0


In [32]:
df.head()

Unnamed: 0,age,sex,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52.0,Female,Female,120.0,125.0,0.0,0.0,152.0,0.0,0.0,1.0,2.0,2.0,1.0,0.0
1,53.0,0,Female,140.0,203.0,0.0,0.0,155.0,0.0,3.1,0.0,0.0,3.0,0.0,0.0
2,70.0,0,Female,145.0,174.0,0.0,1.0,125.0,0.0,2.6,0.0,0.0,3.0,0.0,0.0
3,61.0,0,Female,148.0,203.0,0.0,1.0,161.0,0.0,0.0,2.0,1.0,3.0,0.0,0.0
4,62.0,0,Female,138.0,294.0,0.0,1.0,152.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   float64
 1   sex       1025 non-null   object 
 2   gender    1025 non-null   object 
 3   cp        1025 non-null   float64
 4   trestbps  1025 non-null   float64
 5   chol      1025 non-null   float64
 6   fbs       1025 non-null   float64
 7   restecg   1025 non-null   float64
 8   thalach   1025 non-null   float64
 9   exang     1025 non-null   float64
 10  oldpeak   1025 non-null   float64
 11  slope     1025 non-null   float64
 12  ca        1025 non-null   float64
 13  thal      1025 non-null   float64
 14  target    1025 non-null   float64
dtypes: float64(13), object(2)
memory usage: 120.2+ KB


In [34]:
df.describe()

Unnamed: 0,age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,55.155122,124.926829,232.719024,0.0,0.661463,152.381463,0.0,0.799317,1.320976,0.43122,2.292683,0.613659,0.381463
std,7.888278,10.911823,41.884923,0.0,0.49758,15.96633,0.0,0.985049,0.587466,0.692736,0.564402,0.487148,0.485983
min,35.0,100.0,125.0,0.0,0.0,112.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,51.0,120.0,204.0,0.0,0.0,145.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0
50%,58.0,120.0,220.0,0.0,1.0,152.0,0.0,0.4,1.0,0.0,2.0,1.0,0.0
75%,59.0,130.0,261.0,0.0,1.0,162.0,0.0,1.4,2.0,1.0,3.0,1.0,1.0
max,74.0,156.0,354.0,0.0,2.0,192.0,0.0,3.6,2.0,2.0,3.0,1.0,1.0


In [35]:
df.columns

Index(['age', 'sex', 'gender', 'cp', 'trestbps', 'chol', 'fbs', 'restecg',
       'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

In [36]:
df.shape

(1025, 15)

In [37]:
url=url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv"
df = pd.read_csv(url)

In [None]:
##df = pd.read_json("data.json")