# Pandas

Pandas is an open-source data manipulation and analysis library for Python. It provides data structures and functions needed to manipulate structured data seamlessly. Pandas is particularly well-suited for working with tabular data, such as data stored in spreadsheets or databases.

### Key Features:
1. **Data Structures**:
   - **Series**: A one-dimensional labeled array capable of holding any data type.
   - **DataFrame**: A two-dimensional labeled data structure with columns of potentially different types, similar to a table in a database or a data frame in R.

2. **Data Manipulation**:
   - **Indexing and Selection**: Access and modify data using labels and positions.
   - **Merging and Joining**: Combine multiple DataFrames using SQL-like joins.
   - **Grouping**: Group data for aggregation and transformation.
   - **Reshaping**: Pivot, stack, and unstack data to change its layout.
   - **Handling Missing Data**: Detect, remove, and fill missing values.

3. **Data Analysis**:
   - **Descriptive Statistics**: Calculate summary statistics like mean, median, and standard deviation.
   - **Time Series**: Handle and manipulate time series data.
   - **Visualization**: Integrate with libraries like Matplotlib for data visualization.


In [1]:
import pandas as pd

In [2]:
# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)
print()

# Select a column
print(df['Name'])
print()

# Filter rows based on a condition
print(df[df['Age'] > 30])
print()

# Add a new column
df['Salary'] = [70000, 80000, 90000]
print(df)
print()

# Calculate summary statistics
print(df.describe())

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

      Name  Age     City
2  Charlie   35  Chicago

      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000

        Age   Salary
count   3.0      3.0
mean   30.0  80000.0
std     5.0  10000.0
min    25.0  70000.0
25%    27.5  75000.0
50%    30.0  80000.0
75%    32.5  85000.0
max    35.0  90000.0


In [3]:
import numpy as np

In [4]:
labels = ['A', 'B', 'C', 'D']
list1 = [1, 2, 3, 4]
list2 = np.array([40, 50, 60])
dictionary = {
    'A': 10,
    'B': 20,
    'C': 30,
}

In [5]:
pd.Series(data=list1)

0    1
1    2
2    3
3    4
dtype: int64

In [6]:
pd.Series(data=list1, index=labels)

A    1
B    2
C    3
D    4
dtype: int64

In [7]:
pd.Series(data=labels)

0    A
1    B
2    C
3    D
dtype: object

In [8]:
pd.Series([sum, print, len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

In [9]:
pd.Series(list2)

0    40
1    50
2    60
dtype: int32

In [10]:
pd.Series(dictionary)

A    10
B    20
C    30
dtype: int64

In [11]:
my_dataset = {
    "Cars": ["BMW", "Bugatti", "Lamborghini"],
    "Color": ["Red", "Blue", "Green"],
}

pd.Series(my_dataset)

Cars     [BMW, Bugatti, Lamborghini]
Color             [Red, Blue, Green]
dtype: object

In [12]:
pd.DataFrame(my_dataset)

Unnamed: 0,Cars,Color
0,BMW,Red
1,Bugatti,Blue
2,Lamborghini,Green


In [13]:
fruits = ["Apple", "Orange", "Banana", "Grapes"]
colors = ["Red", "Orange", "Yellow", "Green"]

print(pd.Series(fruits, colors), "\n")
print(pd.DataFrame({"Fruits": fruits, "Colors": colors}))

Red        Apple
Orange    Orange
Yellow    Banana
Green     Grapes
dtype: object 

   Fruits  Colors
0   Apple     Red
1  Orange  Orange
2  Banana  Yellow
3  Grapes   Green


In [14]:
pd.Series([1, 2, 3, 4], ['a', 'b', 'c', 'd'])

a    1
b    2
c    3
d    4
dtype: int64

In [15]:
import random

In [16]:
# Seed is used to save the state of the random generator
random.seed(40)
print("First:", random.randint(50, 100)) # Prints 79

random.seed(40)
print("Second: ", random.randint(50, 100)) # Also prints 79

First: 79
Second:  79


In [17]:
# random.seed(10)
print(random.random())

random.seed(10)
print(random.random())

0.5794876581271581
0.5714025946899135


In [18]:
# randn for normal distribution (positive and negative values)
# np.random.seed(5)
pd.DataFrame(np.random.randn(5, 4), index="A B C D E".split(), columns="P Q R S".split())

Unnamed: 0,P,Q,R,S
A,-0.880216,-1.124792,-2.237082,0.152265
B,1.329084,0.095936,-0.873125,-0.004997
C,0.063208,0.599832,-0.688645,1.111726
D,-0.233116,-2.032303,-0.302994,1.062598
E,0.479928,0.693184,-0.733692,-0.42006


In [19]:
# rand for positive values only
df = pd.DataFrame(np.random.rand(5, 4), index="A B C D E".split(), columns="P Q R S".split())
print(df)

print(df["P"]) # Lists the first column

print(df[["R", "S"]]) # R and S columns

          P         Q         R         S
A  0.369498  0.326707  0.268009  0.574409
B  0.426284  0.365147  0.825715  0.344388
C  0.483149  0.347895  0.958949  0.451255
D  0.499380  0.082573  0.403521  0.200722
E  0.764149  0.935239  0.978174  0.101058
A    0.369498
B    0.426284
C    0.483149
D    0.499380
E    0.764149
Name: P, dtype: float64
          R         S
A  0.268009  0.574409
B  0.825715  0.344388
C  0.958949  0.451255
D  0.403521  0.200722
E  0.978174  0.101058


In [20]:
print(type(df["P"]))
print(type(df[["P", "Q"]]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [21]:
# Add new column
df["T"] = [1, 2, 3, 4, 5]
df

Unnamed: 0,P,Q,R,S,T
A,0.369498,0.326707,0.268009,0.574409,1
B,0.426284,0.365147,0.825715,0.344388,2
C,0.483149,0.347895,0.958949,0.451255,3
D,0.49938,0.082573,0.403521,0.200722,4
E,0.764149,0.935239,0.978174,0.101058,5


In [22]:
# Drop a row or column (not modifying the original dataframe)
print(df.drop("T", axis=1))
print(df.drop("D", axis=0))
df

          P         Q         R         S
A  0.369498  0.326707  0.268009  0.574409
B  0.426284  0.365147  0.825715  0.344388
C  0.483149  0.347895  0.958949  0.451255
D  0.499380  0.082573  0.403521  0.200722
E  0.764149  0.935239  0.978174  0.101058
          P         Q         R         S  T
A  0.369498  0.326707  0.268009  0.574409  1
B  0.426284  0.365147  0.825715  0.344388  2
C  0.483149  0.347895  0.958949  0.451255  3
E  0.764149  0.935239  0.978174  0.101058  5


Unnamed: 0,P,Q,R,S,T
A,0.369498,0.326707,0.268009,0.574409,1
B,0.426284,0.365147,0.825715,0.344388,2
C,0.483149,0.347895,0.958949,0.451255,3
D,0.49938,0.082573,0.403521,0.200722,4
E,0.764149,0.935239,0.978174,0.101058,5


In [23]:
# Locate a row by the label
df.loc['A']

P    0.369498
Q    0.326707
R    0.268009
S    0.574409
T    1.000000
Name: A, dtype: float64

In [24]:
# Locate a row by the index
df.iloc[2]

P    0.483149
Q    0.347895
R    0.958949
S    0.451255
T    3.000000
Name: C, dtype: float64

In [25]:
# Locate a cell by index and column
df.loc["B", "R"]

0.8257146944547653

In [26]:
# Locate a part of the dataframe
df.loc[["A", "B"], ["R", "S"]]

Unnamed: 0,R,S
A,0.268009,0.574409
B,0.825715,0.344388


In [27]:
# Zip: Combine 2 lists
a1 = ['A', 'A', 'A', 'B', 'B', 'B']
b1 = [1, 2, 3, 1, 2, 3]

p = list(zip(a1, b1))
p

[('A', 1), ('A', 2), ('A', 3), ('B', 1), ('B', 2), ('B', 3)]

In [28]:
df = pd.DataFrame(np.random.randn(6, 2), index=p, columns=['X', 'Y'])
df

Unnamed: 0,X,Y
"(A, 1)",-0.277855,0.560018
"(A, 2)",0.536641,0.367636
"(A, 3)",0.903835,1.198604
"(B, 1)",-2.248095,1.667915
"(B, 2)",-1.389806,0.314747
"(B, 3)",1.065496,0.061149


#### Write a python programme to convert a list to series.

In [29]:
import pandas as pd

In [30]:
l = [2, 3, 4, 5]
series = pd.Series(l)
series

0    2
1    3
2    4
3    5
dtype: int64

#### Write a python programme to generate a series of dates from 1st August 2024 to 24 August 2024 (inclusive).

In [31]:
import datetime

In [36]:
start = datetime.datetime.strptime("01-08-2024", "%d-%m-%Y")
end = datetime.datetime.strptime("24-08-2024", "%d-%m-%Y")

dates = pd.date_range(start, end)
dates

DatetimeIndex(['2024-08-01', '2024-08-02', '2024-08-03', '2024-08-04',
               '2024-08-05', '2024-08-06', '2024-08-07', '2024-08-08',
               '2024-08-09', '2024-08-10', '2024-08-11', '2024-08-12',
               '2024-08-13', '2024-08-14', '2024-08-15', '2024-08-16',
               '2024-08-17', '2024-08-18', '2024-08-19', '2024-08-20',
               '2024-08-21', '2024-08-22', '2024-08-23', '2024-08-24'],
              dtype='datetime64[ns]', freq='D')

#### Convert a dictionary into dataframe and display it.

In [33]:
d = {
    "1": "A",
    "2": "B",
    "3": "C",
    "4": "D",
    "5": "E",
}

df = pd.DataFrame(d.items())
print(df)

   0  1
0  1  A
1  2  B
2  3  C
3  4  D
4  5  E


#### Create a 2D list and convert it into data frame and display it.

In [34]:
l = [
    [1, 2],
    [3, 4],
    [5, 6],
]

df = pd.DataFrame(l)
df

Unnamed: 0,0,1
0,1,2
1,3,4
2,5,6
