# PANDAS

**INTRODUCTION  
Pandas is an open-source, Python library which provides easy-to-use data structures for the data analysis.**
**Pandas is great for data manipulation, data analysis, and data visualization.**

#### WHY PANDAS?

1. We can easily read and write from and to CSV files, or even databases
+ Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
+ We can manipulate the data by columns,.Columns can be inserted and deleted from DataFrame and higher dimensional objects
+ Intuitive merging and joining data sets
    5. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
    6. Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data.



In [1]:
import pandas as pd
import numpy as np

# Series
A series is a 1-D data structure. It is basically a labelled array that can hold different data types:
* int
* float
* String
* Python object
* many more

The data is aligned in a row fashon.

### Creating the series from the random dataset_for_series

In [12]:
# Generate random data for the series
data_for_series = np.random.randint(0, 100, size=(10)) #size(row num, column num)
print(data_for_series)

[84 83  8 57 26 55 49 95 64 32]


In [13]:
s = pd.Series(data_for_series)

In [14]:
print(s)

0    84
1    83
2     8
3    57
4    26
5    55
6    49
7    95
8    64
9    32
dtype: int32


We can get the information

In [15]:
s.describe()

count    10.000000
mean     55.300000
std      27.737059
min       8.000000
25%      36.250000
50%      56.000000
75%      78.250000
max      95.000000
dtype: float64

### Accessing the data

Using **head()**, we can find the data from the top. If no parameter is passed, the **head()** function displays the first 5 data.

In [0]:
s.head()

0    47
1    29
2    65
3    76
4    10
dtype: int64

In [0]:
s.head(3)

0    47
1    29
2    65
dtype: int64

Similarly, we have **tail()** function for the bottom of the data.

In [0]:
s.tail()

5    45
6    82
7     1
8    88
9    12
dtype: int64

In [0]:
s.tail(3)

7     1
8    88
9    12
dtype: int64

#### Using **loc** and **iloc**  
* **loc** gets the rows using the particular ***label*** from the index.
* **iloc** get the rows using the particular ***position*** of the index. (Note: iloc only takes integers)

In [17]:
print(s)
s.loc[2]

0    84
1    83
2     8
3    57
4    26
5    55
6    49
7    95
8    64
9    32
dtype: int32


8

In [18]:
print(s)
s.iloc[2]

0    84
1    83
2     8
3    57
4    26
5    55
6    49
7    95
8    64
9    32
dtype: int32


8

Let's check for the series with our custom index

In [19]:
index_for_series = 'A B C D E F G H I J'.split()

In [20]:
print(index_for_series)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']


In [21]:
s_c = pd.Series(data_for_series, index=index_for_series)

In [22]:
print(s_c)

A    84
B    83
C     8
D    57
E    26
F    55
G    49
H    95
I    64
J    32
dtype: int32


In [0]:
s_c.head()

A    47
B    29
C    65
D    76
E    10
dtype: int64

In [0]:
s_c.head(3)

A    47
B    29
C    65
dtype: int64

In [0]:
s_c.tail()

F    45
G    82
H     1
I    88
J    12
dtype: int64

In [0]:
s_c.tail(3)

H     1
I    88
J    12
dtype: int64

In [26]:
s_c.loc['C']

8

In [0]:
s_c.iloc[2]

65

In [24]:
s_c.loc[['C','D']]

C     8
D    57
dtype: int32

In [31]:
print(s_c)
s_c.iloc[2:4] #2 inclusive, 4 exclusive

A    84
B    83
C     8
D    57
E    26
F    55
G    49
H    95
I    64
J    32
dtype: int32


C     8
D    57
dtype: int32

We get error if we try the following

```python
s_c.loc[1]
s_c.iloc['A']
```

#### Slicing the series

In [32]:
print(s)
s[0:3] #inclusive:exclusive

0    84
1    83
2     8
3    57
4    26
5    55
6    49
7    95
8    64
9    32
dtype: int32


0    84
1    83
2     8
dtype: int32

In [33]:
s_c[0:3]

A    84
B    83
C     8
dtype: int32

#### Creating series from dictionary

In [42]:
data_dict = {
    'A':1,
    'B':100,
    'C':12,
    'D':14,
    'E':155,
    'F':22,
    'G':123,
    'H':21,
    'I':51,
    'J':74,
}
print(data_dict)
type(data_dict)

{'A': 1, 'B': 100, 'C': 12, 'D': 14, 'E': 155, 'F': 22, 'G': 123, 'H': 21, 'I': 51, 'J': 74}


dict

In [48]:
s_dict = pd.Series(data_dict)

In [49]:
s_dict

A      1
B    100
C     12
D     14
E    155
F     22
G    123
H     21
I     51
J     74
dtype: int64

## DATAFRAME

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. A pandas DataFrame can be created using the following constructor −

```python
pandas.DataFrame( data, index, columns, dtype)
```

#Creating dataframe from dictionary
data = [['Harry', 15], ['John', 14], ['Andrew', 13]]
df = pd.DataFrame(data, columns=['Name','Age'])
df

In [54]:
data = [['Harry', 15], ['John', 14], ['Andrew', 13]] 
df = pd.DataFrame(data, columns=['Name','Age']) 
df

[['Harry', 15], ['John', 14], ['Andrew', 13]]


Unnamed: 0,Name,Age
0,Harry,15
1,John,14
2,Andrew,13


# Creating random data

In [58]:
data = np.random.randint(0,10,(5,4)) #Ranging from 0-10 with 5*4 matrix

In [59]:
print(data)

[[5 5 2 3]
 [6 2 2 5]
 [6 3 3 1]
 [6 1 4 3]
 [1 1 3 9]]


In [62]:
#creating dataframe from random numbers
my_index = '1 2 3 4 5'.split()
print(my_index)
#df = pd.DataFrame(data,columns='A B C D'.split())
df = pd.DataFrame(data,index=my_index,columns='A B C D'.split())
df

['1', '2', '3', '4', '5']


Unnamed: 0,A,B,C,D
1,5,5,2,3
2,6,2,2,5
3,6,3,3,1
4,6,1,4,3
5,1,1,3,9


### Checking the information

In [63]:
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 1 to 5
Data columns (total 4 columns):
A    5 non-null int32
B    5 non-null int32
C    5 non-null int32
D    5 non-null int32
dtypes: int32(4)
memory usage: 120.0+ bytes
None


Unnamed: 0,A,B,C,D
1,5,5,2,3
2,6,2,2,5
3,6,3,3,1
4,6,1,4,3
5,1,1,3,9


In [64]:
print(df)
df.describe()


   A  B  C  D
1  5  5  2  3
2  6  2  2  5
3  6  3  3  1
4  6  1  4  3
5  1  1  3  9


Unnamed: 0,A,B,C,D
count,5.0,5.0,5.0,5.0
mean,4.8,2.4,2.8,4.2
std,2.167948,1.67332,0.83666,3.03315
min,1.0,1.0,2.0,1.0
25%,5.0,1.0,2.0,3.0
50%,6.0,2.0,3.0,3.0
75%,6.0,3.0,3.0,5.0
max,6.0,5.0,4.0,9.0


# Indexing columns

In [75]:
type(df['B'])
df['B']
# type(df[['B','D']]) 
df[['B','D']] #df.B can also be used

Unnamed: 0,B,D
1,5,3
2,2,5
3,3,1
4,1,3
5,1,9


In [0]:
type(df)

pandas.core.frame.DataFrame

#### Using iloc and loc
Here we show the equivalent iloc vs loc

In [76]:
print(df)
df.iloc[0]

   A  B  C  D
1  5  5  2  3
2  6  2  2  5
3  6  3  3  1
4  6  1  4  3
5  1  1  3  9


A    5
B    5
C    2
D    3
Name: 1, dtype: int32

In [77]:
df.loc['1']

A    5
B    5
C    2
D    3
Name: 1, dtype: int32

In [92]:
#type(df.iloc[:, [0]])
df.iloc[:, [0]]

Unnamed: 0,A
1,5
2,6
3,6
4,6
5,1


In [87]:
df.loc[:, ['A']]

Unnamed: 0,A
1,5
2,6
3,6
4,6
5,1


In [93]:
df.iloc[:, [1, 3]]

Unnamed: 0,B,D
1,5,3
2,2,5
3,3,1
4,1,3
5,1,9


In [99]:
df.loc[:, ['B','D']]

Unnamed: 0,B,D
1,5,3
2,2,5
3,3,1
4,1,3
5,1,9


In [103]:
#type(df.iloc[1, [1, 3]])
df.iloc[1, [1, 3]]

B    2
D    5
Name: 2, dtype: int32

In [104]:
df.loc['1', ['B', 'D']]

B    5
D    3
Name: 1, dtype: int32

In [105]:
df.iloc[1:3, [1,3]]

Unnamed: 0,B,D
2,2,5
3,3,1


In [106]:
df.loc[['2', '3'], ['B','D']]

Unnamed: 0,B,D
2,2,5
3,3,1


In [107]:
df.iloc[[1, 3, 4],[1, 3]]

Unnamed: 0,B,D
2,2,5
4,1,3
5,1,9


In [113]:
df.loc[['2','4','5'], ['B','D']]

Unnamed: 0,B,D
2,2,5
4,1,3
5,1,9


#### Combine two columns

In [114]:
df['Sum'] = df['A'] + df['B']
df

Unnamed: 0,A,B,C,D,Sum
1,5,5,2,3,10
2,6,2,2,5,8
3,6,3,3,1,9
4,6,1,4,3,7
5,1,1,3,9,2


In [116]:
df_copy = df
df_copy

Unnamed: 0,A,B,C,D,Sum
1,5,5,2,3,10
2,6,2,2,5,8
3,6,3,3,1,9
4,6,1,4,3,7
5,1,1,3,9,2


#### Dropping a column

In [118]:
df.drop('Sum',axis=1) #Column drop axis=1

Unnamed: 0,A,B,C,D,Sum
1,5,5,2,3,10
2,6,2,2,5,8
3,6,3,3,1,9
4,6,1,4,3,7
5,1,1,3,9,2


We can see that the column is not dropped from the dataframe. To save the changes we need to set a flag ***inplace*** specifying the changes to occur in the dataframe

In [120]:
df.drop('Sum', axis=1, inplace=True)

In [121]:
df

Unnamed: 0,A,B,C,D
1,5,5,2,3
2,6,2,2,5
3,6,3,3,1
4,6,1,4,3
5,1,1,3,9


In [0]:
df

In [125]:
df.drop('1', axis=0)

Unnamed: 0,A,B,C,D
2,6,2,2,5
3,6,3,3,1
4,6,1,4,3
5,1,1,3,9


In [126]:
df

Unnamed: 0,A,B,C,D
1,5,5,2,3
2,6,2,2,5
3,6,3,3,1
4,6,1,4,3
5,1,1,3,9


In [132]:
df.drop('1', axis=0, inplace=True)

ValueError: labels ['1'] not contained in axis

In [129]:
df

Unnamed: 0,A,B,C,D
2,6,2,2,5
3,6,3,3,1
4,6,1,4,3
5,1,1,3,9


We can also do the following

In [130]:
# Revert the dataframe with the 'Sum' column

df = pd.DataFrame(df_copy)

In [140]:
df
df['Sum']=df['A']+df['B']

In [141]:
df_copy = pd.DataFrame(df)

In [142]:
df = df.drop('Sum',axis=1) #Column drop axis=1

In [143]:
df_copy

Unnamed: 0,A,B,C,D,Sum
2,6,2,2,5,8
3,6,3,3,1,9
4,6,1,4,3,7
5,1,1,3,9,2


In [144]:
df

Unnamed: 0,A,B,C,D
2,6,2,2,5
3,6,3,3,1
4,6,1,4,3
5,1,1,3,9


In [146]:
df = df.drop('2',axis=0) #row drop axis=0
df

Unnamed: 0,A,B,C,D
3,6,3,3,1
4,6,1,4,3
5,1,1,3,9


In [149]:
df.loc[:,'A']

3    6
4    6
5    1
Name: A, dtype: int32

In [0]:
# selecting particular cell
df.loc['2','D']

In [0]:
#selecting particular cell using index

df.iloc[2,3]

In [0]:
df.loc[['3','2'],['A','B']]

In [0]:
df

In [0]:
df>0

In [0]:
df[df>5] = 'Changed'

In [0]:
df

### Filling NaN with some value

In [0]:
import numpy as np

In [0]:
df[df=='Changed'] = np.NaN

In [0]:
df

In [0]:
df['D'].mean()

In [0]:
df.fillna(df['D'].mean())

In [0]:
df

to make the changes, we need to use **inplace**

In [0]:
df.fillna(df['D'].mean(), inplace=True)

In [0]:
df