# Pandas

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.</br>
It provides data structures and functions for working with structured data, such as tabular data in CSV or Excel files, SQL databases, or other data sources.

Some of the key data structures in Pandas are the Series and DataFrame objects.
- A Series is a one-dimensional array-like object that can hold any data type, such as integers, strings, or floats.
- A DataFrame is a two-dimensional table of data with rows and columns, similar to a spreadsheet. Each column in a DataFrame can be of a different data type.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.DataFrame(np.arange(0,20).reshape(5,4),index=['Row1','Row2','Row3','Row4','Row5'],columns=["Column1","Column2","Column3","Column4"])

In [3]:
df

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


Accessing the elements

In [4]:
df.loc['Row1']

Column1    0
Column2    1
Column3    2
Column4    3
Name: Row1, dtype: int64

Check the type

In [5]:
type(df.loc['Row1'])

pandas.core.series.Series

In [6]:
df.iloc[:3,2:]

Unnamed: 0,Column3,Column4
Row1,2,3
Row2,6,7
Row3,10,11


In [7]:
df['Column1'].value_counts()

0     1
4     1
8     1
12    1
16    1
Name: Column1, dtype: int64

Convert Dataframes into array

In [8]:
df.iloc[:,:].values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [9]:
from sklearn import datasets

In [10]:
iris_data=datasets.load_iris()

In [11]:
# Combine the features and target arrays into a single array
iris_array = np.column_stack((iris_data.data, iris_data.target))

In [12]:
# Create a list of column names for the DataFrame
column_names = iris_data.feature_names + ["target"]

In [13]:
# Convert the iris array to a Pandas DataFrame
iris_df = pd.DataFrame(data=iris_array, columns=column_names)

In [14]:
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


df.info()- is a method of Pandas DataFrame that provides a concise summary of the DataFrame, including the number of non-null values, the data type of each column, and the memory usage. It is a useful method to quickly inspect the contents of a DataFrame.

In [15]:
iris_df.info

<bound method DataFrame.info of      sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8

df.describe()- is a method of Pandas DataFrame that provides a statistical summary of the numerical columns in the DataFrame. It computes various descriptive statistics, such as the count, mean, standard deviation, minimum, and maximum values, as well as quartiles and percentiles.

In [16]:
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [17]:
#Get the unique category counts
iris_df['target'].value_counts()

0.0    50
1.0    50
2.0    50
Name: target, dtype: int64

In [18]:
iris_df[iris_df['sepal length (cm)']>7]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
102,7.1,3.0,5.9,2.1,2.0
105,7.6,3.0,6.6,2.1,2.0
107,7.3,2.9,6.3,1.8,2.0
109,7.2,3.6,6.1,2.5,2.0
117,7.7,3.8,6.7,2.2,2.0
118,7.7,2.6,6.9,2.3,2.0
122,7.7,2.8,6.7,2.0,2.0
125,7.2,3.2,6.0,1.8,2.0
129,7.2,3.0,5.8,1.6,2.0
130,7.4,2.8,6.1,1.9,2.0


In [19]:
iris_df.corr()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
sepal length (cm),1.0,-0.11757,0.871754,0.817941,0.782561
sepal width (cm),-0.11757,1.0,-0.42844,-0.366126,-0.426658
petal length (cm),0.871754,-0.42844,1.0,0.962865,0.949035
petal width (cm),0.817941,-0.366126,0.962865,1.0,0.956547
target,0.782561,-0.426658,0.949035,0.956547,1.0


In [20]:
lst_data=[[1,2,3],[3,4,np.nan],[5,6,np.nan],[np.nan,np.nan,np.nan]]

In [21]:
df=pd.DataFrame(lst_data)

In [22]:
df.head()

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,3.0,4.0,
2,5.0,6.0,
3,,,


In [23]:
## HAndling Missing Values

##Drop nan values

df.dropna(axis=0)

Unnamed: 0,0,1,2
0,1.0,2.0,3.0


In [24]:
## URL to CSV

df=pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',sep='\t')

In [25]:
df

Unnamed: 0,item_code,item_name,display_level,selectable,sort_sequence
0,AA0,All items - old base,0,T,2
1,AA0R,Purchasing power of the consumer dollar - old ...,0,T,400
2,SA0,All items,0,T,1
3,SA0E,Energy,1,T,375
4,SA0L1,All items less food,1,T,359
...,...,...,...,...,...
395,SSEA011,College textbooks,3,T,314
396,SSEE041,Smartphones,4,T,335
397,SSFV031A,Food at elementary and secondary schools,3,T,122
398,SSGE013,Infants' equipment,3,T,356
