### Pandas â€“ Loading and Cleaning Data
Today, we'll learn to use **Pandas**, the most essential tool in a Python data scientist's toolkit, to take a raw,
messy dataset and turn it into a clean, reliable foundation for analysis.


1.  The Building Blocks:** What are Pandas `Series` and `DataFrames`?
2.  Getting the Data:** Reading CSV files and first-look inspection.
3.  The Cleaning Workflow:**
    *   Handling Missing Values (`NaN`)
    *   Finding and Removing Duplicates
    *   Correcting Data Types and Formatting
4.  Exporting Our Clean Data.**

The two core data structures in Pandas.

*   **`Series`**: A one-dimensional labeled array, like a single column in a spreadsheet.
*   **`DataFrame`**: A two-dimensional labeled data structure with columns of potentially different types, like a full spreadsheet or an SQL table.

Let's quickly create them to see how they work. First, we need to import pandas. The standard convention is to import it as `pd`.

In [1]:
import pandas as pd

In [8]:
ice_cream = ["Chocolate", "Vanilla", "Strawberry", "Rum Raisin","",""]
pd.Series(ice_cream)

0     Chocolate
1       Vanilla
2    Strawberry
3    Rum Raisin
4              
5              
dtype: object

In [9]:
student = {
    "name": "Pragya",
    "age": 21,
     "marks": {
        "math": 85,
        "science": 90,
        "english": 88
    }
}

In [10]:
pd.Series(student)

name                                         Pragya
age                                              21
marks    {'math': 85, 'science': 90, 'english': 88}
dtype: object

In [11]:
lottery_numbers = [4, 8, 15, 16, 23, 42]
pd.Series(lottery_numbers)

0     4
1     8
2    15
3    16
4    23
5    42
dtype: int64

In [12]:
pd.Series(ice_cream,lottery_numbers)

4      Chocolate
8        Vanilla
15    Strawberry
16    Rum Raisin
23              
42              
dtype: object

In [13]:
registration=[True,True,False,True,False]
pd.Series(registration)

0     True
1     True
2    False
3     True
4    False
dtype: bool

In [14]:
# A Series can have a custom index
student_names = pd.Series(
    [85, 92, 78],
    index=['Alice', 'Bob', 'Charlie']
)
print("A Series :")
print(student_names)

A Series :
Alice      85
Bob        92
Charlie    78
dtype: int64


### Creating a `DataFrame`

A `DataFrame` is the most common object you'll work with. It's a collection of Series. The most common way to create one from scratch is using a dictionary.

In [19]:

data={
'StudentID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Score': [85, 92, 78, 88]
}

In [20]:
student_df=pd.DataFrame(data)
student_df



Unnamed: 0,StudentID,Name,Score
0,101,Alice,85
1,102,Bob,92
2,103,Charlie,78
3,104,David,88


In [27]:
student_df=pd.DataFrame(data, index=data["StudentID"])
student_df

Unnamed: 0,StudentID,Name,Score
101,101,Alice,85
102,102,Bob,92
103,103,Charlie,78
104,104,David,88


## 2. Reading Data & Initial Inspection

Manually creating DataFrames is rare. Most of the time, you'll load data from a file, most commonly a CSV (Comma-Separated Values) file.

We'll use the powerful `pd.read_csv()` function.

In [53]:
messy_data_csv = """OrderID,OrderDate,Product,Price,Quantity,Region
1001,2023-01-05,Laptop,$100,2,North
1002,2023-01-07,Mouse,$25.50,5,South
1003,2023-01-10,Keyboard,,3,North
1004,2023-01-12,Monitor,$300,,"West"
1005,2023-01-15,Webcam,$45.99,1,East
1002,2023-01-07,Mouse,$25.50,5,South
1006,2023-01-18,,$15.00,2,East
1007,2023-01-20,Laptop,$1200.00,1, North
1008,2023-01-22,External HDD,$80,4,USA
"""

with open ('sales_data_messy.csv','w')as f:
    f.write(messy_data_csv)

    
    

In [55]:

df=pd.read_csv("sales_data_messy.csv")
df=df.set_index("OrderID")
df

Unnamed: 0_level_0,OrderDate,Product,Price,Quantity,Region
OrderID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1001,2023-01-05,Laptop,$100,2.0,North
1002,2023-01-07,Mouse,$25.50,5.0,South
1003,2023-01-10,Keyboard,,3.0,North
1004,2023-01-12,Monitor,$300,,West
1005,2023-01-15,Webcam,$45.99,1.0,East
1002,2023-01-07,Mouse,$25.50,5.0,South
1006,2023-01-18,,$15.00,2.0,East
1007,2023-01-20,Laptop,$1200.00,1.0,North
1008,2023-01-22,External HDD,$80,4.0,USA


In [68]:
messy_data_csv = """OrderID,OrderDate,Product,Price,Quantity,Region
1001,2023-01-05,Laptop,100,2,North
1002,2023-01-07,Mouse,25.50,5,South
1003,2023-01-10,Keyboard,,3,North
1004,2023-01-12,Monitor,300,,"West"
1005,2023-01-15,Webcam,45.99,1,East
1002,2023-01-07,Mouse,25.50,5,South
1006,2023-01-18,,15.00,2,East
1007,2023-01-20,Laptop,1200.00,1, North
1008,2023-01-22,External HDD,80,4,USA
"""

with open ('sales_data_messy.csv','w')as f:
    f.write(messy_data_csv)

df=pd.read_csv("sales_data_messy.csv")

    

In [69]:
df.head()

Unnamed: 0,OrderID,OrderDate,Product,Price,Quantity,Region
0,1001,2023-01-05,Laptop,100.0,2.0,North
1,1002,2023-01-07,Mouse,25.5,5.0,South
2,1003,2023-01-10,Keyboard,,3.0,North
3,1004,2023-01-12,Monitor,300.0,,West
4,1005,2023-01-15,Webcam,45.99,1.0,East


In [70]:
df.tail()

Unnamed: 0,OrderID,OrderDate,Product,Price,Quantity,Region
4,1005,2023-01-15,Webcam,45.99,1.0,East
5,1002,2023-01-07,Mouse,25.5,5.0,South
6,1006,2023-01-18,,15.0,2.0,East
7,1007,2023-01-20,Laptop,1200.0,1.0,North
8,1008,2023-01-22,External HDD,80.0,4.0,USA


In [71]:
df.shape

(9, 6)

In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   OrderID    9 non-null      int64  
 1   OrderDate  9 non-null      object 
 2   Product    8 non-null      object 
 3   Price      8 non-null      float64
 4   Quantity   8 non-null      float64
 5   Region     9 non-null      object 
dtypes: float64(2), int64(1), object(3)
memory usage: 564.0+ bytes


In [73]:
df.describe()

Unnamed: 0,OrderID,Price,Quantity
count,9.0,8.0,8.0
mean,1004.222222,223.99875,2.875
std,2.438123,405.081484,1.642081
min,1001.0,15.0,1.0
25%,1002.0,25.5,1.75
50%,1004.0,62.995,2.5
75%,1006.0,150.0,4.25
max,1008.0,1200.0,5.0


In [78]:

new_df =df[["Price","Product"]]
new_df

Unnamed: 0,Price,Product
0,100.0,Laptop
1,25.5,Mouse
2,,Keyboard
3,300.0,Monitor
4,45.99,Webcam
5,25.5,Mouse
6,15.0,
7,1200.0,Laptop
8,80.0,External HDD
