# Pandas

Pandas is a popular open-source data manipulation and analysis library for Python. It provides data structures for efficiently storing and manipulating large datasets. Here are some important points about Pandas:

#### Data Structures:

Pandas introduces two primary data structures: Series and DataFrame.
A Series is a one-dimensional array with labeled indices, similar to a column in a spreadsheet.
A DataFrame is a two-dimensional table with labeled axes (rows and columns).
Data Input/Output:

Pandas supports various file formats, including CSV, Excel, SQL, HDF5, and more, making it easy to read and write data.
Common functions include pd.read_csv(), pd.to_csv(), pd.read_excel(), pd.to_excel(), etc.
Indexing and Selection:

Pandas uses labels to index and select data. You can use labels, slices, boolean indexing, or positional indexing.
.loc[] is primarily label-based indexing, while .iloc[] is integer-location based indexing.
Handling Missing Data:

Pandas provides methods for handling missing data, such as dropna() to remove missing values and fillna() to fill or interpolate missing values.
Data Cleaning and Transformation:

Pandas supports various data cleaning and transformation operations, such as merging/joining, grouping, pivoting, and reshaping data.
Statistical and Mathematical Operations:

Pandas enables statistical and mathematical operations on data, including mean, median, sum, standard deviation, and more.
Methods like describe() provide summary statistics for numerical columns.
Time Series Data:

Pandas has robust support for time series data with features for date range generation, date shifting, and frequency conversion.
The resample() and asfreq() functions are commonly used for time-based operations.
Plotting:

Pandas integrates with Matplotlib for easy plotting of data using plot() method on Series and DataFrame objects.
Common plot types include line plots, bar plots, scatter plots, and histograms.
GroupBy Operations:

The groupby() function allows grouping data based on some criteria and applying a function to each group independently.
Merging and Concatenating:

Pandas provides functions like concat() and merge() to combine DataFrames either along rows or columns.
Performance Optimization:

Efficient data handling is crucial for large datasets. Pandas provides functions to optimize performance, such as apply(), map(), and vectorized operations.
Categorical Data:

Pandas supports categorical data types, which can be useful for memory and performance optimization when dealing with large datasets.
MultiIndexing:

Pandas allows creating DataFrames with MultiIndex, enabling hierarchical indexing for more complex data structures.
![image.png](attachment:image.png)

In [1]:
import numpy as np
import pandas as pd


x = pd.DataFrame([pd.date_range('14-10-2023','23-10-2023'),
                                  np.random.randint(-100,-1,10),
                                  np.random.randint(1,100,10),
                                  [chr(ele) for ele in np.random.randint(97,123,10)],
                                  [chr(ele) for ele in np.random.randint(65,91,10)]])

In [2]:
x

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2023-10-14 00:00:00,2023-10-15 00:00:00,2023-10-16 00:00:00,2023-10-17 00:00:00,2023-10-18 00:00:00,2023-10-19 00:00:00,2023-10-20 00:00:00,2023-10-21 00:00:00,2023-10-22 00:00:00,2023-10-23 00:00:00
1,-32,-14,-82,-37,-4,-64,-95,-65,-34,-64
2,83,30,60,11,26,2,87,62,49,23
3,q,z,o,h,l,l,e,r,n,i
4,T,P,C,A,W,W,U,B,Q,W


x.T: This transposes the DataFrame x. In other words, it swaps rows and columns.

x.copy(): This creates a copy of the transposed DataFrame. The .copy() method is used to ensure that modifications to the new DataFrame do not affect the original DataFrame. It creates a new object in memory.

In [3]:
x=x.copy().T
x

Unnamed: 0,0,1,2,3,4
0,2023-10-14,-32,83,q,T
1,2023-10-15,-14,30,z,P
2,2023-10-16,-82,60,o,C
3,2023-10-17,-37,11,h,A
4,2023-10-18,-4,26,l,W
5,2023-10-19,-64,2,l,W
6,2023-10-20,-95,87,e,U
7,2023-10-21,-65,62,r,B
8,2023-10-22,-34,49,n,Q
9,2023-10-23,-64,23,i,W


x.columns = ['col_'+str(ele) for ele in x.columns]: This line is renaming the columns of the DataFrame x. It creates a new list of column names where each original column name is prefixed with the string 'col_'. The list comprehension (['col_'+str(ele) for ele in x.columns]) is used to iterate over the original column names and create a new list of modified names.

x.index = ['row_'+str(ele) for ele in x.index]: This line is renaming the index (row labels) of the DataFrame x. Similar to the columns, it creates a new list of index labels where each original index label is prefixed with the string 'row_'.

In [4]:
x.columns = ['col_'+str(ele) for ele in x.columns]
x.index = ['row_'+str(ele) for ele in x.index]
x

Unnamed: 0,col_0,col_1,col_2,col_3,col_4
row_0,2023-10-14,-32,83,q,T
row_1,2023-10-15,-14,30,z,P
row_2,2023-10-16,-82,60,o,C
row_3,2023-10-17,-37,11,h,A
row_4,2023-10-18,-4,26,l,W
row_5,2023-10-19,-64,2,l,W
row_6,2023-10-20,-95,87,e,U
row_7,2023-10-21,-65,62,r,B
row_8,2023-10-22,-34,49,n,Q
row_9,2023-10-23,-64,23,i,W


In [5]:
x.rename(columns = {'col_0' : 'date'},
        index = {'row_9' : 'ninthrow'},inplace = True)
x

Unnamed: 0,date,col_1,col_2,col_3,col_4
row_0,2023-10-14,-32,83,q,T
row_1,2023-10-15,-14,30,z,P
row_2,2023-10-16,-82,60,o,C
row_3,2023-10-17,-37,11,h,A
row_4,2023-10-18,-4,26,l,W
row_5,2023-10-19,-64,2,l,W
row_6,2023-10-20,-95,87,e,U
row_7,2023-10-21,-65,62,r,B
row_8,2023-10-22,-34,49,n,Q
ninthrow,2023-10-23,-64,23,i,W


drop=True: This parameter is set to True to discard the current index and replace it with the default integer index. If set to False, the current index would be added as a new column in the DataFrame.

inplace=True: This parameter is set to True to modify the DataFrame in place, meaning the changes will be applied to the existing DataFrame x without the need to create a new one. If set to False (or omitted), a new DataFrame with the reset index would be returned.

In [6]:
x.reset_index(drop=True , inplace=True)
x

Unnamed: 0,date,col_1,col_2,col_3,col_4
0,2023-10-14,-32,83,q,T
1,2023-10-15,-14,30,z,P
2,2023-10-16,-82,60,o,C
3,2023-10-17,-37,11,h,A
4,2023-10-18,-4,26,l,W
5,2023-10-19,-64,2,l,W
6,2023-10-20,-95,87,e,U
7,2023-10-21,-65,62,r,B
8,2023-10-22,-34,49,n,Q
9,2023-10-23,-64,23,i,W




The code x.insert(loc=5, column='col_5', value=np.random.randint(10, 50, 10)) is attempting to insert a new column named 'col_5' into the DataFrame x at position 5. However, there is a mismatch in the size of the values generated by np.random.randint(10, 50, 10) (which generates an array of 10 random integers) and the number of rows in the DataFrame.

In [7]:
x.insert(loc = 5 ,column = 'col_5' ,value = np.random.randint(10,50,10))
x

Unnamed: 0,date,col_1,col_2,col_3,col_4,col_5
0,2023-10-14,-32,83,q,T,21
1,2023-10-15,-14,30,z,P,36
2,2023-10-16,-82,60,o,C,12
3,2023-10-17,-37,11,h,A,21
4,2023-10-18,-4,26,l,W,17
5,2023-10-19,-64,2,l,W,38
6,2023-10-20,-95,87,e,U,29
7,2023-10-21,-65,62,r,B,16
8,2023-10-22,-34,49,n,Q,10
9,2023-10-23,-64,23,i,W,17


x[['col_4']]: This part of the code selects the 'col_4' column as a DataFrame, not as a Series. The double square brackets ([['col_4']]) ensure that the result is a DataFrame with one column, rather than a Series.

[2:5]: This part of the code uses slicing to select rows 2 to 4 (excluding row 5) from the DataFrame obtained in the previous step.

In [8]:
x[['col_4']][2:5]

Unnamed: 0,col_4
2,C
3,A
4,W


In [9]:
np.random.seed(23)
y = pd.DataFrame({'col1':np.random.randint(-1000,1000,1000),
                  'col2':np.random.randint(-100,100,1000),
                  'col3':np.random.normal(0,1,1000)})
y

Unnamed: 0,col1,col2,col3
0,-405,70,-0.308605
1,-258,65,-0.861110
2,64,4,0.493138
3,993,-52,-0.027219
4,-50,51,-0.730339
...,...,...,...
995,-699,46,-0.654231
996,-133,-65,-1.221007
997,853,-45,0.206911
998,199,38,-1.494436


col1: Contains 1000 random integers between -1000 (inclusive) and 1000 (exclusive).
col2: Contains 1000 random integers between -100 (inclusive) and 100 (exclusive).
col3: Contains 1000 random numbers drawn from a normal distribution with mean 0 and standard deviation 1.

In [13]:
y.head() ##The y.head() method is used to display the first few rows of the DataFrame y. By default, it shows the first 5 rows.

Unnamed: 0,col1,col2,col3
0,-405,70,-0.308605
1,-258,65,-0.86111
2,64,4,0.493138
3,993,-52,-0.027219
4,-50,51,-0.730339


In [14]:
y.tail()#The y.tail() method is used to display the last few rows of the DataFrame y. By default, it shows the last 5 rows

Unnamed: 0,col1,col2,col3
995,-699,46,-0.654231
996,-133,-65,-1.221007
997,853,-45,0.206911
998,199,38,-1.494436
999,-364,-33,-0.964724


In [16]:
y.sample(4)

Unnamed: 0,col1,col2,col3
208,-851,84,-1.68677
438,700,46,-0.705908
188,922,77,-0.434953
128,-777,-84,-0.300938


The y.sample(4) method is used to randomly sample 4 rows from the DataFrame y. It returns a new DataFrame containing a random subset of rows from the original DataFrame. 

In [18]:
y.describe()

Unnamed: 0,col1,col2,col3
count,1000.0,1000.0,1000.0
mean,-17.5,-0.238,-0.019928
std,587.412197,57.664153,1.015113
min,-996.0,-100.0,-3.129831
25%,-533.0,-51.0,-0.715867
50%,-21.0,2.0,-0.012334
75%,500.25,48.0,0.675573
max,998.0,99.0,3.2774


In [19]:
x['col_5'] = 10
x.describe(include = 'all')

Unnamed: 0,date,col_1,col_2,col_3,col_4,col_5
count,10,10.0,10.0,10,10,10.0
unique,,9.0,10.0,9,8,
top,,-64.0,83.0,l,W,
freq,,2.0,1.0,2,3,
mean,2023-10-18 12:00:00,,,,,10.0
min,2023-10-14 00:00:00,,,,,10.0
25%,2023-10-16 06:00:00,,,,,10.0
50%,2023-10-18 12:00:00,,,,,10.0
75%,2023-10-20 18:00:00,,,,,10.0
max,2023-10-23 00:00:00,,,,,10.0


In [20]:
x.describe(include = ['object', 'int64']).T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
col_1,10.0,9.0,-64.0,2.0,,,,,,,
col_2,10.0,10.0,83.0,1.0,,,,,,,
col_3,10.0,9.0,l,2.0,,,,,,,
col_4,10.0,8.0,W,3.0,,,,,,,
col_5,10.0,,,,10.0,0.0,10.0,10.0,10.0,10.0,10.0


In [21]:
x.describe(exclude = 'float64')

Unnamed: 0,date,col_1,col_2,col_3,col_4,col_5
count,10,10.0,10.0,10,10,10.0
unique,,9.0,10.0,9,8,
top,,-64.0,83.0,l,W,
freq,,2.0,1.0,2,3,
mean,2023-10-18 12:00:00,,,,,10.0
min,2023-10-14 00:00:00,,,,,10.0
25%,2023-10-16 06:00:00,,,,,10.0
50%,2023-10-18 12:00:00,,,,,10.0
75%,2023-10-20 18:00:00,,,,,10.0
max,2023-10-23 00:00:00,,,,,10.0


In [22]:
x = pd.DataFrame([pd.date_range('14-10-2023','23-10-2023'),
                                  np.random.randint(-100,-1,10),
                                  np.random.randint(1,100,10),
                                  [chr(ele) for ele in np.random.randint(97,123,10)],
                                  [chr(ele) for ele in np.random.randint(65,91,10)]])

In [23]:
x = x.T


In [24]:
x[1].dtype

dtype('O')

In [25]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   0       10 non-null     datetime64[ns]
 1   1       10 non-null     object        
 2   2       10 non-null     object        
 3   3       10 non-null     object        
 4   4       10 non-null     object        
dtypes: datetime64[ns](1), object(4)
memory usage: 532.0+ bytes


In [26]:
    x[1] = x[1].astype('int64')
x[1]

0   -81
1   -73
2   -97
3   -74
4    -2
5   -53
6   -24
7   -85
8   -65
9   -82
Name: 1, dtype: int64