<h1>Python Style Guide/Naming Convention</h1>
<p><b>Use following table</b></p>

<table style="width:100%">
  <tr>
    <th>Action</th>
    <th>Naming Convention</th> 
    <th>Description</th>
  </tr>
  <tr>
    <td>Package/Module Name</td>
    <td>lowercasename</td>
    <td>Prefably all-lowercase name. Underscore to seperate words accepted</td>
  </tr>
  <tr>
    <td>Class Name</td>
    <td>CapWord</td>
    <td>Normally CapWord convention needs to be followed. Exceptions exists </td>
  </tr>
  <tr>
    <td>Functions</td>
    <td>lowercase_undrscor_sprated</td>
    <td>Function Names should be lowercase, with words separated by underscores as necessary to improve readability</td>
  </tr>
  <tr>
    <td>Variables</td>
    <td>lowercase_undrscor_sprated</td>
    <td>Variables Name should be lowercase, with words separated by underscores as necessary to improve readability</td>
  </tr>
  <tr>
    <td>Function and Method Argument</td>
    <td>Use "self" for first argumnet to instance method. Use "cls" for the first argument to class method</td>
    <td>If a function argument's name clashes with a reserved keyword, it is generally better to append a single trailing underscore rather than use an abbreviation or spelling corruption. Thus class_ is better than clss. </td>
  </tr>
  <tr>
    <td>Method Name and Instance variable</td>
    <td>lowercase_undrscor_sprated</td>
    <td>Exceptions may apply</td>
  </tr>
  <tr>
    <td>Constant</td>
    <td>ALLCAP_UNDRSCOR_SPRATED</td>
    <td>Constants are usually defined on a module level and written in all capital letters with underscores separating words.</td>
  </tr>
</table>
<p><b>For more on Style Guide and Naming convention click link below</b></p>
<p><a href="https://www.python.org/dev/peps/pep-0008/#package-and-module-names">PEP 8--Style Guide for Python</a></p>

</body>
</html>

<h1>Reading CSV and excel files from different sources</h1>

### Local file system

In [13]:
import pandas as pd
path = r'C:\Users\username\.....\filename.csv'
df = pd.read_csv(path)

FileNotFoundError: [Errno 2] File b'C:\\Users\\username\\.....\\filename.csv' does not exist: b'C:\\Users\\username\\.....\\filename.csv'

### From GitHub

In [None]:
import pandas as pd
github_url = 'https://raw.githubusercontent.com/username/...../filename.csv'
df = pd.read_csv(github_url)

### From Azure data lake store file system

In [None]:
from azure.datalake.store import core, lib, multithread
import pandas as pd
import os

directory_id = 'xxx-xx-xxxx-xx-xxxx'
application_key = 'xxxccxcs@#dsgfxxx'
application_id = 'xxxx-xxxx-xx-xx-xxxxx'

adls_cred = lib.auth(tenant_id = directory_id, client_secret= application_key, client_id= application_id)
adls_name = "storename"
adls_client = core.AzureDLFileSystem(adls_cred, store_name= adls_name)

f = adls_client.open('.../../.../filename.csv', 'rb') # Path similar to local directory
df = pd.read_csv(f)


<p><b>For more on reading csv and excel files go to following documentations </b></p>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">pandas.read_csv</a></p>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html">pandas.read_excel</a></p>


<h1>Creating DataFrame</h1>

### From dictionary

In [None]:
import pandas as pd
sample_dict = {'id_col':[1,2,3,4], 'val_col': ['John', 'Jacob', 'Jingleheimer', 'Schmidt']}
df_from_dict = pd.DataFrame(data= sample_dict) 
df_from_dict

Unnamed: 0,id_col,val_col
0,1,John
1,2,Jacob
2,3,Jingleheimer
3,4,Schmidt


### From numpy array

In [None]:
import pandas as pd
import numpy as np
df_from_numpy = pd.DataFrame(np.array([[1, 12, 'John'], [4, 13, 'Jacob'], [7, 10, 'Jingleheimer']]),
                             columns=['id', 'age', 'name'])
df_from_numpy

Unnamed: 0,id,age,name
0,1,12,John
1,4,13,Jacob
2,7,10,Jingleheimer


<h1>Drop column by Column Name</h1>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html?highlight=drop#pandas.DataFrame.drop">pandas.DataFrame.drop</a></p>

In [None]:
import pandas as pd
import numpy as np
sample_dict = {'id':np.arange(start = 1, stop = 5), 
               'name': ['John', 'Jacob', 'Jingleheimer', 'Schmidt'],
               'age':np.random.randint(low=0, high = 5, size = 4)}
df_from_dict = pd.DataFrame(data= sample_dict) 
df_col_dropped = df_from_dict.drop(['age'] , axis =1)

print("Before\n")
print(df_from_dict.head())

print("After Drop\n")
print(df_col_dropped.head())


Before

   id          name  age
0   1          John    4
1   2         Jacob    4
2   3  Jingleheimer    3
3   4       Schmidt    3
After Drop

   id          name
0   1          John
1   2         Jacob
2   3  Jingleheimer
3   4       Schmidt


<h1>Drop multiple columns by Name</h1>

In [None]:
import pandas as pd
import numpy as np
sample_dict = {'id':np.arange(start = 1, stop = 5), 
               'name': ['John', 'Jacob', 'Jingleheimer', 'Schmidt'],
               'age':np.random.randint(low=0, high = 5, size = 4),
              'height': np.random.randint(20,30,size=4)}
df_from_dict = pd.DataFrame(data= sample_dict) 

df_col_dropped = df_from_dict.drop(['age', 'height'] , axis =1)

print("Before\n")
print(df_from_dict.head())

print("\nAfter Drop\n")
print(df_col_dropped.head())

Before

   id          name  age  height
0   1          John    1      24
1   2         Jacob    2      25
2   3  Jingleheimer    4      25
3   4       Schmidt    1      20

After Drop

   id          name
0   1          John
1   2         Jacob
2   3  Jingleheimer
3   4       Schmidt


<h1>Filter DF by value(s) in column</h1>

In [None]:
import pandas as pd
import numpy as np
sample_dict = {'id':np.arange(start = 1, stop = 5), 
               'name': ['John', 'Jacob', 'Jingleheimer', 'Schmidt'],
               'age':np.random.randint(low=0, high = 5, size = 4),
              'height': np.random.randint(20,30,size=4)}
df = pd.DataFrame(data= sample_dict) 

val_list = ['John', 'Jacob']

filtered_df = df[df['name'].isin(val_list)]
print("Before\n")
print(df)

print('\nAfter\n')
print(filtered_df)

Before

   id          name  age  height
0   1          John    1      27
1   2         Jacob    1      24
2   3  Jingleheimer    1      24
3   4       Schmidt    3      23

After

   id   name  age  height
0   1   John    1      27
1   2  Jacob    1      24


<h1>File System Operation in Azure Data Lake Store</h1>

In [None]:
from azure.datalake.store import core, lib, multithread
import pandas as pd
import os

directory_id = 'xxx-xx-xxxx-xx-xxxx'
application_key = 'xxxccxcs@#dsgfxxx'
application_id = 'xxxx-xxxx-xx-xx-xxxxx'

adls_cred = lib.auth(tenant_id = directory_id, client_secret= application_key, client_id= application_id)
adls_name = "storename"
adls_client = core.AzureDLFileSystem(adls_cred, store_name= adls_name)


### Create a directory

In [None]:
adls_client.mkdir('/newfolder')

### Upload a file in Azure Data Lake Store

In [None]:
multithread.ADLUploader(adls_client, lpath="C:\\.....\\file.csv", rpath="/newfolder/file.csv", nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304)

### Download a file from Azure Data Lake Store

In [None]:
multithread.ADLDownloader(adls_client, lpath='C:\\user.....\\mysamplefile.txt.out', rpath='/newfolder/mysamplefile.txt', nthreads=64, overwrite=True, buffersize=4194304, blocksize=4194304)

### Delete a directory

In [None]:
adlsFileSystemClient.rm('/newfolder', recursive=True)

<h1>Convert DataFrame to csv</h1>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html">pandas.DataFrame.to_csv</a></p>

In [None]:
import pandas as pd
import numpy as np
sample_dict = {'id':np.arange(start = 1, stop = 5), 
               'name': ['John', 'Jacob', 'Jingleheimer', 'Schmidt'],
               'age':np.random.randint(low=0, high = 5, size = 4),
              'height': np.random.randint(20,30,size=4)}
df = pd.DataFrame(data= sample_dict) 

output_path = r'C:\User\...\...\filename.csv'
df.to_csv(output_path, index= False) 


<h1>Combining multiple DataFrames </h1>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html?highlight=concat#pandas.concat">pandas.concat</a></p>

### Appending rows between two or more DF

In [None]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[1,'john'], [2,'jacob']]),
                  columns=['id','name'])
df2 = pd.DataFrame(np.array([[3,'Jingleheimer'], [4,'Schmidt']]),
                  columns=['id','name'])
combined_df = pd.concat([df1, df2], ignore_index=True)

print('\ndf1\n',df1)
print('\ndf2\n',df2)

print('\nAfter concatening two dataframes\n',co)


df1
   id   name
0  1   john
1  2  jacob

df2
   id          name
0  3  Jingleheimer
1  4       Schmidt

After concatening two dataframes
   id          name
0  1          john
1  2         jacob
2  3  Jingleheimer
3  4       Schmidt


### Combining DataFrames when only some attributes match

In [None]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[1,'john',5], [2,'jacob',2]]),
                  columns=['id','name','age'])
df2 = pd.DataFrame(np.array([[3,'Jingleheimer'], [4,'Schmidt']]),
                  columns=['id','name'])

combined_df = pd.concat([df1, df2], sort=False)
print('\ndf1\n',df1)
print('\ndf2\n',df2)
print('\nCompare two dataframes. df1 has 3 attributes. df2 is missing age')

print('\nAfter concatening two dataframes\n',combined_df)
print('\nThe output will add NaN for age attrubute')


df1
   id   name age
0  1   john   5
1  2  jacob   2

df2
   id          name
0  3  Jingleheimer
1  4       Schmidt

Compare two dataframes. df1 has 3 attributes. df2 is missing age

After concatening two dataframes
   id          name  age
0  1          john    5
1  2         jacob    2
0  3  Jingleheimer  NaN
1  4       Schmidt  NaN

The output will add NaN for age attrubute


Combine DataFrames and return only attrubutes that are shared by both DataFrames 

In [None]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[1,'john',5], [2,'jacob',2]]),
                  columns=['id','name','age'])
df2 = pd.DataFrame(np.array([[3,'Jingleheimer'], [4,'Schmidt']]),
                  columns=['id','name'])

combined_df = pd.concat([df1, df2], join= 'inner')
print('\ndf1\n',df1)
print('\ndf2\n',df2)
print('\nCompare two dataframes. df1 has 3 attributes. df2 is missing age')

print('\nAfter concatening two dataframes\n',combined_df)
print('\nThe output ignores the age attribute as the second dataframe is missing that attribute')


df1
   id   name age
0  1   john   5
1  2  jacob   2

df2
   id          name
0  3  Jingleheimer
1  4       Schmidt

Compare two dataframes. df1 has 3 attributes. df2 is missing age

After concatening two dataframes
   id          name
0  1          john
1  2         jacob
0  3  Jingleheimer
1  4       Schmidt

The output ignores the age attribute as the second dataframe is missing that attribute


### Combining Dataframes horizontally

In [None]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[1,'john'], [2,'jacob'],[3,'Jingleheimer'], [4,'Schmidt']]),
                  columns=['id','name'])
df2 = pd.DataFrame(np.array([[5,'male'],[4,'male'],[3,'male'],[6,'female']]),
                  columns=['age','gender'])

combined_df = pd.concat([df1, df2], axis=1)
print('\ndf1\n',df1)
print('\ndf2\n',df2)

print('\nAfter concatening two dataframes\n',combined_df)
print('\nBased on the concatnation the new attributes from df2 are added horizontally')


df1
   id          name
0  1          john
1  2         jacob
2  3  Jingleheimer
3  4       Schmidt

df2
   age  gender
0   5    male
1   4    male
2   3    male
3   6  female

After concatening two dataframes
   id          name age  gender
0  1          john   5    male
1  2         jacob   4    male
2  3  Jingleheimer   3    male
3  4       Schmidt   6  female

Based on the concatnation the new attributes from df2 are added horizontally


<h1>Set Index from column</h1>


In [None]:
import pandas as pd
import numpy as np
sample_dict = {'id':np.arange(start = 100, stop = 104, step = 1), 
               'name': ['John', 'Jacob', 'Jingleheimer', 'Schmidt'],
               'age':np.random.randint(low=0, high = 5, size = 4),
              'height': np.random.randint(20,30,size=4)}
df = pd.DataFrame(data= sample_dict) 
print('\nBefore\n',df)
print('\nAfter\n',df.set_index('id'))


Before
     id          name  age  height
0  100          John    2      20
1  101         Jacob    3      25
2  102  Jingleheimer    4      23
3  103       Schmidt    1      24

After
              name  age  height
id                            
100          John    2      20
101         Jacob    3      25
102  Jingleheimer    4      23
103       Schmidt    1      24


<h1>Joining DataFrames</h1>

In [None]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[1,'john'], [2,'jacob'],[3,'Jingleheimer'], [4,'Schmidt']]),
                  columns=['id','name'])
df2 = pd.DataFrame(np.array([[1, 5,'male'],[2, 4,'male'],[3, 3,'male'],[4, 6,'female']]),
                  columns=['id','age','gender'])

combined_df = df1.join(df2, lsuffix='_left', rsuffix='_right')
print('\ndf1\n',df1)
print('\ndf2\n',df2)

print('\nAfter joining two dataframes\n',combined_df)
print('\nBased on the join operation new attributes from df2 are added horizontally. However we have duplicat ids')



df1
   id          name
0  1          john
1  2         jacob
2  3  Jingleheimer
3  4       Schmidt

df2
   id age  gender
0  1   5    male
1  2   4    male
2  3   3    male
3  4   6  female

After joining two dataframes
   id_left          name id_right age  gender
0       1          john        1   5    male
1       2         jacob        2   4    male
2       3  Jingleheimer        3   3    male
3       4       Schmidt        4   6  female

Based on the join operation new attributes from df2 are added horizontally. However we have duplicat ids


In [None]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([[1,'john'], [2,'jacob'],[3,'Jingleheimer'], [4,'Schmidt']]),
                  columns=['id','name'])
df2 = pd.DataFrame(np.array([[1, 5,'male'],[2, 4,'male'],[3, 3,'male'],[4, 6,'female']]),
                  columns=['id','age','gender'])

combined_df = df1.join(df2.set_index('id'), on ='id')
print('\ndf1\n',df1)
print('\ndf2\n',df2)

print('\nAfter joining two dataframes\n',combined_df)
print('\nBased on the join the new attributes from df2 are added horizontally. Also now we do not nave to worry about duplicate id')



df1
   id          name
0  1          john
1  2         jacob
2  3  Jingleheimer
3  4       Schmidt

df2
   id age  gender
0  1   5    male
1  2   4    male
2  3   3    male
3  4   6  female

After joining two dataframes
   id          name age  gender
0  1          john   5    male
1  2         jacob   4    male
2  3  Jingleheimer   3    male
3  4       Schmidt   6  female

Based on the join the new attributes from df2 are added horizontally. Also now we do not nave to worry about duplicate id


<h1>Converting JSON to DataFrame</h1>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html">pandas.read_json</a></p>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html">pandas.io.json.json_normalize</a></p>

In [None]:
from pandas.io.json import json_normalize
data = [{'Country': 'Nepal',
          'ShortName': 'NP',
          'OtherInfo': {
               'GovType': 'Democracy',
              'HeadState': 'Prime Minister'
          },
          'District/State/Province': [{'name': 'Kathmandu', 'population': 12345},
                      {'name': 'Morang', 'population': 40000},
                      {'name': 'Sunsari', 'population': 60000}]},
         {'Country': 'India',
          'ShortName': 'IN',
          'OtherInfo': {
               'GovType': 'Democracy',
              'HeadState': 'President'
          },
          'District/State/Province': [{'name': 'Bihar', 'population': 1234},
                       {'name': 'UP', 'population': 1337}]}]



df = json_normalize(data, 'District/State/Province', ['Country', 'ShortName', 
                                                      ['OtherInfo', 'GovType'],
                                                     ['OtherInfo', 'HeadState']])
df

Unnamed: 0,name,population,Country,ShortName,OtherInfo.GovType,OtherInfo.HeadState
0,Kathmandu,12345,Nepal,NP,Democracy,Prime Minister
1,Morang,40000,Nepal,NP,Democracy,Prime Minister
2,Sunsari,60000,Nepal,NP,Democracy,Prime Minister
3,Bihar,1234,India,IN,Democracy,President
4,UP,1337,India,IN,Democracy,President


<h1>Downloading saving .tgz (TAR) file</h1>
<p><a href="https://docs.python.org/2/library/tarfile.html">tarfile</a></p>

In [None]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://www.website.com/..."
FILE_PATH = os.path.join("dataset","folder")
FILE_URL = DOWNLOAD_ROOT + "datasets/folder/file.tgz"

def fetch_data(file_url = FILE_URL, file_path = FILE_PATH):
    if not os.path.isdir(file_path):
        os.makedirs(file_path)
    tgz_path = os.path.join(file_path, "file.tgz")
    urllib.request.urlretrieve(file_path, tgz_path)
    file_tgz = tarfile.open(tgz_path)
    file_tgz.extractall(path = file_path)
    file_tgz.close()

fetch_data()

<h1>Ordinal Encoding</h1>
<p><a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html">sklearn.preprocessing.OrdinalEncoder</a></p>

### Encoding single column

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from pandas.io.json import json_normalize

data = [{'Country': 'Nepal',
          'ShortName': 'NP',
          'OtherInfo': {
               'GovType': 'Democracy',
              'HeadState': 'Prime Minister'
          },
          'District/State/Province': [{'name': 'Kathmandu', 'population': 12345},
                      {'name': 'Morang', 'population': 40000},
                      {'name': 'Sunsari', 'population': 60000}]},
         {'Country': 'India',
          'ShortName': 'IN',
          'OtherInfo': {
               'GovType': 'Democracy',
              'HeadState': 'President'
          },
          'District/State/Province': [{'name': 'Bihar', 'population': 1234},
                       {'name': 'UP', 'population': 1337}]}]



df = json_normalize(data, 'District/State/Province', ['Country', 'ShortName'])
print('Before')
print(df)

ordinal_encoder = OrdinalEncoder()
encoded_data = ordinal_encoder.fit_transform(df[['Country']])
print(encoded_data)

# To add the encoded value to the existing dataframe...
df_encoded = df.copy()
df_encoded['country_encoded'] = encoded_data
print('\nAfter ')
print(df_encoded)
print('\nView encoced categories run ordinal_encoder.categories_\n')
print(ordinal_encoder.categories_)

Before
        name  population Country ShortName
0  Kathmandu       12345   Nepal        NP
1     Morang       40000   Nepal        NP
2    Sunsari       60000   Nepal        NP
3      Bihar        1234   India        IN
4         UP        1337   India        IN
[[1.]
 [1.]
 [1.]
 [0.]
 [0.]]

To add the encoded value to the existing dataframe...

After 
        name  population Country ShortName  country_encoded
0  Kathmandu       12345   Nepal        NP              1.0
1     Morang       40000   Nepal        NP              1.0
2    Sunsari       60000   Nepal        NP              1.0
3      Bihar        1234   India        IN              0.0
4         UP        1337   India        IN              0.0

View encoced categories run ordinal_encoder.categories_

[array(['India', 'Nepal'], dtype=object)]


### Encoding Multiple column

In [None]:
from pandas.io.json import json_normalize
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

data = [{'Country': 'Nepal',
          'ShortName': 'NP',
          'OtherInfo': {
               'GovType': 'Democracy',
              'HeadState': 'Prime Minister'
          },
          'District/State/Province': [{'name': 'Kathmandu', 'population': 12345},
                      {'name': 'Morang', 'population': 40000},
                      {'name': 'Sunsari', 'population': 60000}]},
         {'Country': 'India',
          'ShortName': 'IN',
          'OtherInfo': {
               'GovType': 'Democracy',
              'HeadState': 'President'
          },
          'District/State/Province': [{'name': 'Bihar', 'population': 1234},
                       {'name': 'UP', 'population': 1337}]}]



df = json_normalize(data, 'District/State/Province', ['Country', 
                                                      ['OtherInfo', 'GovType'],
                                                     ['OtherInfo', 'HeadState']])
print("Before Ordinal Encoding")
print(df)

ordinal_cat = ['Country', 'OtherInfo.GovType', 'OtherInfo.HeadState']
print('\nLook to encode following categories', ordinal_cat)

ordinal_encoder = OrdinalEncoder()
encoded_data = ordinal_encoder.fit_transform(df[ordinal_cat])
print(encoded_data)
# To add encoded value as seperate columns
df_new_values = pd.DataFrame(encoded_data, columns= [str(ordinal_cat[i])+'_encoded' for i in range(len(ordinal_cat))])
df_encoded = pd.concat([df,df_new_values], axis =1)
print('\nAfter')
print(df_encoded)
print('\nView encoded categories')
print(ordinal_encoder.categories_)

Before Ordinal Encoding
        name  population Country OtherInfo.GovType OtherInfo.HeadState
0  Kathmandu       12345   Nepal         Democracy      Prime Minister
1     Morang       40000   Nepal         Democracy      Prime Minister
2    Sunsari       60000   Nepal         Democracy      Prime Minister
3      Bihar        1234   India         Democracy           President
4         UP        1337   India         Democracy           President

Look to encode following categories ['Country', 'OtherInfo.GovType', 'OtherInfo.HeadState']
[[1. 0. 1.]
 [1. 0. 1.]
 [1. 0. 1.]
 [0. 0. 0.]
 [0. 0. 0.]]

After
        name  population Country OtherInfo.GovType OtherInfo.HeadState  \
0  Kathmandu       12345   Nepal         Democracy      Prime Minister   
1     Morang       40000   Nepal         Democracy      Prime Minister   
2    Sunsari       60000   Nepal         Democracy      Prime Minister   
3      Bihar        1234   India         Democracy           President   
4         UP       

In [None]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
data = {'name':['John', 'Mary', 'Jane', 'Jacob'],
       'age': [5,2,5,1],
       'gender': ['m','f','f','m'],
       'pass/fail': ['fail','pass','fail','pass']}
df = pd.DataFrame(data)
df_cat = ['gender', 'pass/fail']
ordinal_encoder = OrdinalEncoder()
print(df)
encoded_array = ordinal_encoder.fit_transform(df[df_cat])
pd.concat([df, pd.DataFrame(encoded_array, columns = ['Encoded_'+str(df_cat[i]) for i in range(len(df_cat))])], axis=1)


    name  age gender pass/fail
0   John    5      m      fail
1   Mary    2      f      pass
2   Jane    5      f      fail
3  Jacob    1      m      pass


Unnamed: 0,name,age,gender,pass/fail,Encoded_gender,Encoded_pass/fail
0,John,5,m,fail,1.0,0.0
1,Mary,2,f,pass,0.0,1.0
2,Jane,5,f,fail,0.0,0.0
3,Jacob,1,m,pass,1.0,1.0


<h1>Null Values in DataFrame</h1>
<p><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html">pandas.DataFrame.isna</a></p>

### Get count of Null Values in each columns in DataFrame

In [None]:
import pandas as pd
df = pd.DataFrame({'age': [5, 6, np.NaN],
                   'born': [pd.NaT, pd.Timestamp('1939-05-27'),
                            pd.Timestamp('1940-04-25')],
                   'name': ['Alfred', 'Batman', ''],
                   'toy': [None, 'Batmobile', 'Joker']})
print(df)
df.isna().sum()

   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker


age     1
born    1
name    0
toy     1
dtype: int64

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = {'name':['John', 'Mary', 'Jane', 'Jacob'],
       'age': [5,2,5,1],
       'gender': ['m','f','f','m'],
       'pass/fail': ['fail','pass','fail','pass']}
df = pd.DataFrame(data)
df_cat = ['gender', 'pass/fail']
onehot_encoder = OneHotEncoder()
print(df)
encoded_array = onehot_encoder.fit_transform(df[df_cat])
# pd.concat([df, pd.DataFrame(encoded_array, columns = ['Encoded_'+str(df_cat[i]) for i in range(len(df_cat))])], axis=1)


  return f(*args, **kwds)


    name  age gender pass/fail
0   John    5      m      fail
1   Mary    2      f      pass
2   Jane    5      f      fail
3  Jacob    1      m      pass


  return f(*args, **kwds)


In [None]:
list(onehot_encoder.categories_[0])

['f', 'm']

# Lambda expressions or Anonymous functions

In [14]:
# create a function to return square of a given number 
def square(a): # Define function name and paramenters. This case it takes one value idealy num
    return a*a # intended operation. One expression or operation only 
x = 4 
print(square(x)) # Run the square function

16


In [17]:
# Do the same (calculate square) using Lambda expresstion
f = lambda x: x*x # create a function using lambda expression. THis only works when you intend to use one expression/calculation
print(f(4))



16
44


In [19]:
# Lambda functions can be used inside another function. 
# create a function that give you quadruples of a given number
def test(x): # Create function named test that takes in one input
    return lambda a : a * x # this returns a lambda function with x's value set to whatever is passed. So if 1 is passed the output is a lambda expression (lambda x: x * 1)
    # You can use this function to create other function that can give you double, triple, quadruple and so on. 

testdouble = test(2) # this sets the value of x to be 2. So when we call this function pass a number "x" we now get the output x * 2.
testtriple = test(3) # this sets the value of x to be 3. So when we call this function pass a number "x" we now get the output x * 3.
getquadruple = test(4) # this sets the value of x to be 4. So when we call this function pass a number "x" we now get the output x * 4.

print(testdouble(11))
print(testtriple(11))
print(getquadruple(11))

22
33
44


In [33]:
# Use a lambda expression to seperate names using a delimitor (Title, Firstname, LastName)
import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array([
    [1,'Mr. John Doe',5], 
    [2,'Dr. Jacob Jones',2],
    [3, 'Mrs. Jane Roe', 25],
    [4,'Master. Michael River',12]]),
                  columns=['id','name','age'])
print(df1)
print("Inspect the name column. Name consists of title, first and last name")

# Create a new column called title which consists of the title only. Use lambda expression to complet this task. do the same and seperate first and last name as well
df1['title'] = df1["name"].apply(lambda x : x.split()[0])
df1['firstname'] = df1["name"].apply(lambda x : x.split()[1])
df1['lastname'] = df1["name"].apply(lambda x : x.split()[2])
df1




  id                   name age
0  1           Mr. John Doe   5
1  2        Dr. Jacob Jones   2
2  3          Mrs. Jane Roe  25
3  4  Master. Michael River  12
Inspect the name column. Name consists of title, first and last name


AttributeError: 'Series' object has no attribute 'split'

In [None]:
a = "Mr Binay Raut"
f = lambda x : x.split(' ')
f(a)

['Mr', 'Binay', 'Raut']