What is Pandas?

* It is an open source library written in python.
* It leverages the power and speed of numpy to make data analysis and preprocessing easy for operations.
* It provides rich and highly robust data operations.

Pandas has two types of data structures:
* 1) Series - It is a one dimensional array with indexes, it stores a single column or row of data in a Dataframe. (similar datatype throughout)
* 2) Dataframe - It is a tabular spreadsheet like structure representing rows each of which contains one or multiple columns. (different columns may have disimilar datatype)

                                                           --Ahmad Raza

Documentation Link: https://pandas.pydata.org/docs/user_guide/index.html

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
ditcl = { #creating a dictionary in python
    "name":['roshan', 'suman', 'sanju', 'samadh'],
    "marks":[92, 34, 24, 97],
    "city":['rampur',' kolkata', 'bareilly', 'antartica']
}

In [4]:
df = pd.DataFrame(ditcl) #creating a data frame

In [5]:
df #displaying the data frame

Unnamed: 0,name,marks,city
0,roshan,92,rampur
1,suman,34,kolkata
2,sanju,24,bareilly
3,samadh,97,antartica


In [6]:
df.to_csv('friends.csv') #creating a CSV file from created data

In [7]:
df.to_csv('friends_index_false.csv', index=False) #CSV File without indexs

In [8]:
data = pd.read_csv('train.csv') #dataframe containing 700 random values of x and y

In [9]:
data #glimpse of dataset

Unnamed: 0,x,y
0,24,21.549452
1,50,47.464463
2,15,17.218656
3,38,36.586398
4,87,87.288984
...,...,...
694,58,58.595006
695,93,94.625094
696,82,88.603770
697,66,63.648685


In [10]:
data.head() #displays a glimpse of dataset from beginning

Unnamed: 0,x,y
0,24,21.549452
1,50,47.464463
2,15,17.218656
3,38,36.586398
4,87,87.288984


In [11]:
data.head(2) #displays first 2 rows of the dataset

Unnamed: 0,x,y
0,24,21.549452
1,50,47.464463


In [12]:
data.tail() #displays a glimpse of dataset from bottom

Unnamed: 0,x,y
694,58,58.595006
695,93,94.625094
696,82,88.60377
697,66,63.648685
698,97,94.975266


In [13]:
data.tail(2) #displays last 2 rows of the data frame

Unnamed: 0,x,y
697,66,63.648685
698,97,94.975266


In [14]:
data['x'] #gives all the values of the column 'x'

0      24
1      50
2      15
3      38
4      87
       ..
694    58
695    93
696    82
697    66
698    97
Name: x, Length: 699, dtype: int64

In [15]:
data['x'][3] #gives value of the element at 3rd index in the x column

38

In [16]:
data['x'][2]=25 #updates value in the dataset

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['x'][2]=25 #updates value in the dataset


In [17]:
data.describe() #statistical analysis of data frame

Unnamed: 0,x,y
count,699.0,699.0
mean,50.028612,49.939869
std,28.939702,29.109217
min,0.0,-3.839981
25%,25.0,24.929968
50%,49.0,48.97302
75%,75.0,74.929911
max,100.0,108.871618


In [18]:
df.index = ['Info 1','Info 2','Info 3','Info 4'] #changing the index names

In [19]:
df

Unnamed: 0,name,marks,city
Info 1,roshan,92,rampur
Info 2,suman,34,kolkata
Info 3,sanju,24,bareilly
Info 4,samadh,97,antartica


In [20]:
df.columns = ['Names', 'Marks', 'City'] #changing column names

In [21]:
df

Unnamed: 0,Names,Marks,City
Info 1,roshan,92,rampur
Info 2,suman,34,kolkata
Info 3,sanju,24,bareilly
Info 4,samadh,97,antartica


In [23]:
type(df['Marks']) #Series - 1D Array

pandas.core.series.Series

In [25]:
type(df) #Dataframe - 2D Array

pandas.core.frame.DataFrame

In [31]:
ser = pd.Series(np.random.rand(34)) #series or 1D having random values and size n, here, n=34

In [34]:
print(ser, type(ser))

0     0.982858
1     0.198697
2     0.165597
3     0.930724
4     0.745215
5     0.457004
6     0.182825
7     0.614965
8     0.557523
9     0.849208
10    0.053098
11    0.371267
12    0.476499
13    0.825942
14    0.040805
15    0.373599
16    0.595210
17    0.624977
18    0.580755
19    0.566730
20    0.670245
21    0.073037
22    0.761822
23    0.292040
24    0.598301
25    0.914020
26    0.957736
27    0.837849
28    0.517682
29    0.016283
30    0.797512
31    0.925624
32    0.556196
33    0.899862
dtype: float64 <class 'pandas.core.series.Series'>


In [41]:
newdf = pd.DataFrame(np.random.rand(334,5), index=np.arange(334)) #generates a dataframe of size m*n having random values, here, m=334 and n=5

In [38]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.746573,0.346514,0.590413,0.214349,0.292463
1,0.886625,0.119876,0.033577,0.028185,0.140623
2,0.948871,0.994769,0.501725,0.351466,0.652369
3,0.338617,0.680656,0.259984,0.402913,0.48359
4,0.412494,0.359296,0.988952,0.680153,0.214434


In [43]:
newdf.describe()

Unnamed: 0,0,1,2,3,4
count,334.0,334.0,334.0,334.0,334.0
mean,0.496666,0.49674,0.516972,0.524138,0.501516
std,0.277048,0.287623,0.283551,0.289986,0.302
min,0.002152,0.003356,0.001744,0.004415,0.00384
25%,0.258194,0.234297,0.272472,0.264535,0.232488
50%,0.505978,0.518553,0.509094,0.525203,0.508891
75%,0.71856,0.729846,0.771347,0.781874,0.771813
max,0.996155,0.998834,0.999024,0.994811,0.997622


In [44]:
type(newdf)

pandas.core.frame.DataFrame

In [46]:
newdf.dtypes #displays the datatype of each column

0    float64
1    float64
2    float64
3    float64
4    float64
dtype: object

In [47]:
newdf[0][0]="harry" #Implicit type conversion to object type

In [51]:
newdf.dtypes #datatype of first column is changed

0     object
1    float64
2    float64
3    float64
4    float64
dtype: object

In [50]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,harry,0.495731,0.482961,0.204074,0.61063
1,0.23607,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425
3,0.257371,0.003356,0.334284,0.498563,0.991031
4,0.677884,0.367446,0.198339,0.844476,0.045933


In [52]:
newdf.index #indexes of all rows

Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            324, 325, 326, 327, 328, 329, 330, 331, 332, 333],
           dtype='int64', length=334)

In [54]:
newdf.columns #number of columns is dataframe

RangeIndex(start=0, stop=5, step=1)

In [56]:
newdf.to_numpy #conversion to a numpy array

<bound method DataFrame.to_numpy of             0         1         2         3         4
0       harry  0.495731  0.482961  0.204074  0.610630
1     0.23607  0.097484  0.649971  0.878693  0.720416
2    0.995941  0.500928  0.591229  0.929397  0.108425
3    0.257371  0.003356  0.334284  0.498563  0.991031
4    0.677884  0.367446  0.198339  0.844476  0.045933
..        ...       ...       ...       ...       ...
329  0.260389  0.995374  0.564194  0.791043  0.046968
330  0.698648  0.463310  0.760772  0.087167  0.320735
331  0.168753  0.748499  0.592493  0.045817  0.096805
332  0.295851  0.313419  0.669735  0.542396  0.722473
333  0.037356  0.832680  0.426004  0.656932  0.858056

[334 rows x 5 columns]>

In [57]:
newdf[0][0]=0.3210

In [59]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.321,0.495731,0.482961,0.204074,0.61063
1,0.23607,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425
3,0.257371,0.003356,0.334284,0.498563,0.991031
4,0.677884,0.367446,0.198339,0.844476,0.045933


In [61]:
newdf.T #transpose of the dataset

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,324,325,326,327,328,329,330,331,332,333
0,0.321,0.23607,0.995941,0.257371,0.677884,0.572393,0.022784,0.278741,0.60738,0.649532,...,0.19222,0.925082,0.395566,0.45953,0.38586,0.260389,0.698648,0.168753,0.295851,0.037356
1,0.495731,0.097484,0.500928,0.003356,0.367446,0.418772,0.866225,0.995092,0.030651,0.640049,...,0.700568,0.161433,0.210646,0.827003,0.341391,0.995374,0.46331,0.748499,0.313419,0.83268
2,0.482961,0.649971,0.591229,0.334284,0.198339,0.937255,0.381129,0.081947,0.954591,0.132876,...,0.30023,0.356868,0.686425,0.705375,0.489003,0.564194,0.760772,0.592493,0.669735,0.426004
3,0.204074,0.878693,0.929397,0.498563,0.844476,0.595417,0.464485,0.254339,0.092389,0.968779,...,0.122843,0.004415,0.013466,0.794151,0.600308,0.791043,0.087167,0.045817,0.542396,0.656932
4,0.61063,0.720416,0.108425,0.991031,0.045933,0.091175,0.912511,0.516702,0.823394,0.090038,...,0.874903,0.241134,0.688128,0.202477,0.390588,0.046968,0.320735,0.096805,0.722473,0.858056


In [66]:
newdf.sort_index(axis=0, ascending=False) #sort the index of the rows, axis =0 represents rows, bydefault ascending is true 

Unnamed: 0,0,1,2,3,4
333,0.037356,0.832680,0.426004,0.656932,0.858056
332,0.295851,0.313419,0.669735,0.542396,0.722473
331,0.168753,0.748499,0.592493,0.045817,0.096805
330,0.698648,0.463310,0.760772,0.087167,0.320735
329,0.260389,0.995374,0.564194,0.791043,0.046968
...,...,...,...,...,...
4,0.677884,0.367446,0.198339,0.844476,0.045933
3,0.257371,0.003356,0.334284,0.498563,0.991031
2,0.995941,0.500928,0.591229,0.929397,0.108425
1,0.23607,0.097484,0.649971,0.878693,0.720416


In [68]:
newdf.sort_index(axis=1, ascending=False) #sort the index of the rows, axis =1 represents columns, bydefault ascending is true 

Unnamed: 0,4,3,2,1,0
0,0.610630,0.204074,0.482961,0.495731,0.321
1,0.720416,0.878693,0.649971,0.097484,0.23607
2,0.108425,0.929397,0.591229,0.500928,0.995941
3,0.991031,0.498563,0.334284,0.003356,0.257371
4,0.045933,0.844476,0.198339,0.367446,0.677884
...,...,...,...,...,...
329,0.046968,0.791043,0.564194,0.995374,0.260389
330,0.320735,0.087167,0.760772,0.463310,0.698648
331,0.096805,0.045817,0.592493,0.748499,0.168753
332,0.722473,0.542396,0.669735,0.313419,0.295851


In [69]:
newdf.head()

Unnamed: 0,0,1,2,3,4
0,0.321,0.495731,0.482961,0.204074,0.61063
1,0.23607,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425
3,0.257371,0.003356,0.334284,0.498563,0.991031
4,0.677884,0.367446,0.198339,0.844476,0.045933


In [72]:
newdf[0] #prints series, that is first column

0         0.321
1       0.23607
2      0.995941
3      0.257371
4      0.677884
         ...   
329    0.260389
330    0.698648
331    0.168753
332    0.295851
333    0.037356
Name: 0, Length: 334, dtype: object

In [75]:
newdf2 = newdf #newdf2 becomes a view of newdf, call by reference

In [77]:
newdf2[0][0]=9878

In [79]:
newdf #changes are reflected in actual dataset

Unnamed: 0,0,1,2,3,4
0,9878,0.495731,0.482961,0.204074,0.610630
1,0.23607,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425
3,0.257371,0.003356,0.334284,0.498563,0.991031
4,0.677884,0.367446,0.198339,0.844476,0.045933
...,...,...,...,...,...
329,0.260389,0.995374,0.564194,0.791043,0.046968
330,0.698648,0.463310,0.760772,0.087167,0.320735
331,0.168753,0.748499,0.592493,0.045817,0.096805
332,0.295851,0.313419,0.669735,0.542396,0.722473


In [81]:
newdf3=newdf.copy() #copying the dataset

In [82]:
newdf3[0][0]=3

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  newdf3[0][0]=3


In [84]:
newdf.head() #changes not reflected in the actual dataset, call by value

Unnamed: 0,0,1,2,3,4
0,9878.0,0.495731,0.482961,0.204074,0.61063
1,0.23607,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425
3,0.257371,0.003356,0.334284,0.498563,0.991031
4,0.677884,0.367446,0.198339,0.844476,0.045933


In [85]:
newdf.loc[0,0] = 654

In [86]:
newdf.head(2)

Unnamed: 0,0,1,2,3,4
0,654.0,0.495731,0.482961,0.204074,0.61063
1,0.23607,0.097484,0.649971,0.878693,0.720416


In [87]:
newdf.columns = list("ABCDE") #renaming the columns

In [93]:
newdf.loc[1,'A'] = 456

In [94]:
newdf.head(2)

Unnamed: 0,A,B,C,D,E
0,654,0.495731,0.482961,0.204074,0.61063
1,456,0.097484,0.649971,0.878693,0.720416


In [95]:
newdf.loc[0,0] = 231

In [96]:
newdf.head()

Unnamed: 0,A,B,C,D,E,0
0,654.0,0.495731,0.482961,0.204074,0.61063,231.0
1,456.0,0.097484,0.649971,0.878693,0.720416,
2,0.995941,0.500928,0.591229,0.929397,0.108425,
3,0.257371,0.003356,0.334284,0.498563,0.991031,
4,0.677884,0.367446,0.198339,0.844476,0.045933,


In [99]:
newdf = newdf.drop(0, axis=1) #deletes the column 0

In [100]:
newdf.head()

Unnamed: 0,A,B,C,D,E
0,654.0,0.495731,0.482961,0.204074,0.61063
1,456.0,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425
3,0.257371,0.003356,0.334284,0.498563,0.991031
4,0.677884,0.367446,0.198339,0.844476,0.045933


In [114]:
newdf.loc[[1,2,3,4],['C','D']] #to extract a sub-matrix from the whole matrix

Unnamed: 0,C,D
1,0.649971,0.878693
2,0.591229,0.929397
3,0.334284,0.498563
4,0.198339,0.844476


In [118]:
newdf.loc[:,['C','D']] #to extract particular columns

Unnamed: 0,C,D
0,0.482961,0.204074
1,0.649971,0.878693
2,0.591229,0.929397
3,0.334284,0.498563
4,0.198339,0.844476
...,...,...
329,0.564194,0.791043
330,0.760772,0.087167
331,0.592493,0.045817
332,0.669735,0.542396


In [119]:
newdf.loc[[1,2],:] #to extract particular rows

Unnamed: 0,A,B,C,D,E
1,456.0,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425


In [122]:
newdf.loc[(newdf['A']<0.3)] #prints out all the rows of which the A column has value less 0.3

Unnamed: 0,A,B,C,D,E
3,0.257371,0.003356,0.334284,0.498563,0.991031
6,0.022784,0.866225,0.381129,0.464485,0.912511
7,0.278741,0.995092,0.081947,0.254339,0.516702
10,0.200219,0.682353,0.789585,0.795114,0.018611
13,0.113585,0.505269,0.882409,0.851190,0.460082
...,...,...,...,...,...
324,0.19222,0.700568,0.300230,0.122843,0.874903
329,0.260389,0.995374,0.564194,0.791043,0.046968
331,0.168753,0.748499,0.592493,0.045817,0.096805
332,0.295851,0.313419,0.669735,0.542396,0.722473


In [126]:
newdf.loc[(newdf['A']<0.3) & (newdf['C']>0.5)] #prints out all the rows which satisfy the given conditions

Unnamed: 0,A,B,C,D,E
10,0.200219,0.682353,0.789585,0.795114,0.018611
13,0.113585,0.505269,0.882409,0.85119,0.460082
22,0.281538,0.918968,0.942341,0.519497,0.338011
23,0.287483,0.25624,0.780501,0.641104,0.690042
30,0.282863,0.298383,0.926759,0.100113,0.15525
54,0.173012,0.687491,0.930427,0.469327,0.095156
55,0.041402,0.881577,0.684132,0.708744,0.117423
70,0.156315,0.363632,0.891182,0.231934,0.87385
78,0.131797,0.473169,0.966107,0.783271,0.232197
79,0.002152,0.311262,0.955216,0.812591,0.918996


-----------------------------------------------------------------------------
Using 'loc' function, you can use names of the rows and columns to refer a cell or a set of cells.

Similarly using 'iloc' funtion, you can refer a cell or a set of cells using their index numbers.

-----------------------------------------------------------------------------


In [129]:
newdf.iloc[0,4]

0.6106296287538958

In [131]:
newdf.iloc[[0,5],[1,2]] #first and sixth elements of second and third columns

Unnamed: 0,B,C
0,0.495731,0.482961
5,0.418772,0.937255


In [200]:
newdf.index[1], newdf.index[3] # returns the label of the 1st row and the 3rd row

(1, 3)

In [127]:
newdf.head()

Unnamed: 0,A,B,C,D,E
0,654.0,0.495731,0.482961,0.204074,0.61063
1,456.0,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425
3,0.257371,0.003356,0.334284,0.498563,0.991031
4,0.677884,0.367446,0.198339,0.844476,0.045933


In [133]:
newdf.drop([0]) #deletes the nth row, here, n=0

Unnamed: 0,A,B,C,D,E
1,456,0.097484,0.649971,0.878693,0.720416
2,0.995941,0.500928,0.591229,0.929397,0.108425
3,0.257371,0.003356,0.334284,0.498563,0.991031
4,0.677884,0.367446,0.198339,0.844476,0.045933
5,0.572393,0.418772,0.937255,0.595417,0.091175
...,...,...,...,...,...
329,0.260389,0.995374,0.564194,0.791043,0.046968
330,0.698648,0.463310,0.760772,0.087167,0.320735
331,0.168753,0.748499,0.592493,0.045817,0.096805
332,0.295851,0.313419,0.669735,0.542396,0.722473


In [136]:
newdf.drop(['A', 'C'], axis=1) #deletes the mentioned columns where axis=1 signifies the column operations

Unnamed: 0,B,D,E
0,0.495731,0.204074,0.610630
1,0.097484,0.878693,0.720416
2,0.500928,0.929397,0.108425
3,0.003356,0.498563,0.991031
4,0.367446,0.844476,0.045933
...,...,...,...
329,0.995374,0.791043,0.046968
330,0.463310,0.087167,0.320735
331,0.748499,0.045817,0.096805
332,0.313419,0.542396,0.722473


In [145]:
newdf.drop(['B','D'], axis=1, inplace=True) #inplace function actually deletes the functions from the dataset without having to assign the dataset with return of the drop function, i.e., we need not write newdf=newdf.drop(---), using inplace newdf.drop(--, inplace=True) does the same job

KeyError: "['B', 'D'] not found in axis"

In [147]:
newdf.head(3)

Unnamed: 0,A,C,E
0,654.0,0.482961,0.61063
1,456.0,0.649971,0.720416
2,0.995941,0.591229,0.108425


In [149]:
newdf.reset_index() #the function resets the index numbering by creating a new column for indexs

Unnamed: 0,index,A,C,E
0,0,654,0.482961,0.610630
1,1,456,0.649971,0.720416
2,2,0.995941,0.591229,0.108425
3,3,0.257371,0.334284,0.991031
4,4,0.677884,0.198339,0.045933
...,...,...,...,...
329,329,0.260389,0.564194,0.046968
330,330,0.698648,0.760772,0.320735
331,331,0.168753,0.592493,0.096805
332,332,0.295851,0.669735,0.722473


In [152]:
newdf.reset_index(drop=True, inplace=True) #the drop function will prevent the reset_index function from creating a new column

In [153]:
newdf.head()

Unnamed: 0,A,C,E
0,654.0,0.482961,0.61063
1,456.0,0.649971,0.720416
2,0.995941,0.591229,0.108425
3,0.257371,0.334284,0.991031
4,0.677884,0.198339,0.045933


_____________________________________________________________________________
One of the important feature of the Pandas library is the ability to perform operations on the cells containing NaN, NULL or Infinity values.
_____________________________________________________________________________

In [169]:
newdf.loc[:,['A']] = None #makes all the values of the column 'A' as Null/x

In [167]:
newdf

Unnamed: 0,A,C,E
0,,0.482961,0.610630
1,,0.649971,0.720416
2,,0.591229,0.108425
3,,0.334284,0.991031
4,,0.198339,0.045933
...,...,...,...
329,,0.564194,0.046968
330,,0.760772,0.320735
331,,0.592493,0.096805
332,,0.669735,0.722473


In [195]:
newdf.isnull().sum() #displays total number of NaN entries in each column

A    334
C      0
E      0
dtype: int64

In [168]:
newdf['A'].isnull() #displays the list of series where the cell of ith index having NULL value has True written infront of it

0      True
1      True
2      True
3      True
4      True
       ... 
329    True
330    True
331    True
332    True
333    True
Name: A, Length: 334, dtype: bool

In [171]:
dfn = pd.DataFrame({"name": ['Sateesh', 'Sambhav', 'Drake'], "car": [np.nan, 'Porche', 'Lamborgini'], "born": [pd.NaT, pd.Timestamp("1949-09-29"), pd.NaT]})

In [175]:
dfn.head()

Unnamed: 0,name,car,born
0,Sateesh,,NaT
1,Sambhav,Porche,1949-09-29
2,Drake,Lamborgini,NaT


In [177]:
dfn.dropna() #drops all the rows containing NaN

Unnamed: 0,name,car,born
1,Sambhav,Porche,1949-09-29


In [178]:
dfn.dropna(how='all', axis=1) #remove a column only if all the values in that column are NaN

Unnamed: 0,name,car,born
0,Sateesh,,NaT
1,Sambhav,Porche,1949-09-29
2,Drake,Lamborgini,NaT


In [181]:
dfn.drop_duplicates(subset=['name'], keep='first') #drops the rows containing the duplicates in the mentioned column, leaving the first element

Unnamed: 0,name,car,born
0,Sateesh,,NaT
1,Sambhav,Porche,1949-09-29
2,Drake,Lamborgini,NaT


In [182]:
dfn.drop_duplicates(subset=['name'], keep='last') #drops the rows containing the duplicates in the mentioned column, leaving the last element

Unnamed: 0,name,car,born
0,Sateesh,,NaT
1,Sambhav,Porche,1949-09-29
2,Drake,Lamborgini,NaT


In [183]:
dfn.drop_duplicates(subset=['name'], keep=False) #drops the rows containing the duplicates in the mentioned column, leaving no similarly identified variable behind

Unnamed: 0,name,car,born
0,Sateesh,,NaT
1,Sambhav,Porche,1949-09-29
2,Drake,Lamborgini,NaT


In [185]:
dfn.info() #displays information about the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   name    3 non-null      object        
 1   car     2 non-null      object        
 2   born    1 non-null      datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 200.0+ bytes


In [187]:
dfn['name'].value_counts(dropna=False) #displays the occurence of each distinct object in the series, dropna=False signifies that NaN values are also counted

Sateesh    1
Sambhav    1
Drake      1
Name: name, dtype: int64

In [188]:
dfn.notnull() #displays a matrix and each cell is assigned either true or false where if the cell doesn't have NaN then it displays true else false

Unnamed: 0,name,car,born
0,True,False,False
1,True,True,True
2,True,True,False


In [190]:
dfn.isnull() #displays a matrix and each cell is assigned either true or false where if the cell has NaN then it displays true else false

Unnamed: 0,name,car,born
0,False,True,True
1,False,False,False
2,False,False,True


In [193]:
data = pd.read_excel('name.xlsx', sheet_name='Sheetx') #You can import x sheet from y.xlsx excel sheet and perform operation on that. Install xlrd before impoerting excel sheet

FileNotFoundError: [Errno 2] No such file or directory: 'name.xlsx'

In [194]:
data.to_excel('name.xlsx', sheet_name='Sheetx') #save data to excel file, install openpyxl before writing an excel sheet

Probelm:
-------------
* Create a dataframe which contains only integers with 3 rows and 2 columns. * Perform the following functions on the dataframes:
_____________________________________________________________________________
1. df.describe()
2. df.mean()
3. df.corr()
4. df.count()
5. df.max()
6. df.min()
7. df.median()
8. df.std()
_____________________________________________________________________________

In [201]:
arr = {'C1':[1,2,3], 'C2':[4,5,6]}

In [202]:
pdf = pd.DataFrame(arr)

In [203]:
pdf

Unnamed: 0,C1,C2
0,1,4
1,2,5
2,3,6


In [204]:
pdf.describe()

Unnamed: 0,C1,C2
count,3.0,3.0
mean,2.0,5.0
std,1.0,1.0
min,1.0,4.0
25%,1.5,4.5
50%,2.0,5.0
75%,2.5,5.5
max,3.0,6.0


In [205]:
pdf.mean() #used to caluculate mean(average) value of each column

C1    2.0
C2    5.0
dtype: float64

In [206]:
pdf.corr() #used to compute the pairwise correlation between columns of a DataFrame. It calculates the correlation coefficient, which is a measure of the linear relationship between two variables

Unnamed: 0,C1,C2
C1,1.0,1.0
C2,1.0,1.0


In [208]:
pdf.count() #counts total number of entities in each column

C1    3
C2    3
dtype: int64

In [210]:
pdf.max() #finds maximum value in each column

C1    3
C2    6
dtype: int64

In [214]:
pdf.min() #finds minimum value in each column

C1    1
C2    4
dtype: int64

In [212]:
pdf.median() #finds median of all values of a column

C1    2.0
C2    5.0
dtype: float64

In [213]:
pdf.std() #finds standard deviation between all values of each column

C1    1.0
C2    1.0
dtype: float64