# <div style="font-family: Trebuchet MS; background-color: #000099; color: #FFFFFF; padding: 12px; line-height: 1;">Notebook Structure </div>

0. **Analysis Summary**
1. **Std Lib Imports**
2. **Data Import**- Covers basic data import and concat operation
3. **Building understanding of data**- Most of the heavy lifting on work here
    * Understanding: Knowing what's in the data before cleaning
    * Data cleaning: Preparing cleaned data (df_) where each row represent a row for each session or joint session of NA
4. **Secondary Import**- Imported data from wikipedia where we've merged given constituencies name with their city & areas for better EDA
5. **Exploration**- EDA of final data (df_final) and comments over relationships


# <div style="font-family: Trebuchet MS; background-color: #000099; color: #FFFFFF; padding: 12px; line-height: 1;">Analysis Summary </div>

**Prompt**: The dataset used is posted with question attached, "Find The Performance of Your MNA"

Hence, we've plotted the presense of MNA of respective local 'areas' by bar plot at the end this notebook. For example, if you were live in Karachi Central I, then MNA has attended most session across all constituencies in Sindh (Perhaps because Karachi Central I is perhaps one of the largest and most dense area in Sindh- maybe even in Pakistan)

**Results**: Here's how the top & worse performance constituencies have been based on their presence in NA sessions:

Sindh-
    Best: Karachi & Tharkparker Areas
    Worse: Matiari & Ghotki

Punjab-
    Best: Lahore-III, Muzzaffargarh IV, Gujranwala
    Worse: Faisalabad-I, Mianwali, Layyah, Gujrat

Balochistant
    Best: Mastung, Jafarabad
    Worse: Kech, Khuzdar

KPK
    Best: Chirtal, Manserah
    Worse: Mardan, South Waziristan, Islamabad III, Peshawar I

Simple google search of constituency name can tell you what area it covers:
A rather detailed view is shared and attached here: https://en.wikipedia.org/wiki/List_of_constituencies_of_Pakistan

**Caveats**
I've not viewed data in aggregate of multiple session sittings & dates. Perhaps, there could some value in going down to that depth to get some more insights. For future contributors, this could be a guideline.

# <div style="font-family: Trebuchet MS; background-color: #000099; color: #FFFFFF; padding: 12px; line-height: 1;">Std Lib Imports </div>

In [141]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 5000)
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px

import os
for dirname, _, filenames in os.walk('data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data\Attendence of Members - Sessions 21 - 43.xls
data\Attendence of Members - Sessions 1 - 20.xls
data\NA-Records.xlsx
data\modified\na-records.csv
data\modified\National Assembly.csv
data\modified\test.csv


In [142]:
#Custom Method used
def session(col):
    return col[0]

def col_details(df, col_index):

    total= len(df[df.columns[col_index]])
    na_count= df[df.columns[col_index]].isna().sum()
    share_na_count= round(float(na_count/total)*100,2)

    print('Column:  ', df.columns[col_index])
    print('\nTotal Values', total, ' - NA count ', na_count, '- Share of NA ', share_na_count)
    print('\nValue_counts:')
    print(df[df.columns[col_index]].value_counts())

# <div style="font-family: Trebuchet MS; background-color: #000099; color: #FFFFFF; padding: 12px; line-height: 1;">Data Import </div>
1. Column views
2. Concat df
3. Sum of null values

In [143]:
root='data/Attendence of Members - Sessions 21 - 43.xls'

df_1= pd.read_excel(root, sheet_name= 'Session 1 to 20')
df_2= pd.read_excel(root,sheet_name= 'Attendence of Members - Session')

df_1.columns= ['meta', 'constituency','name','presense']
df_2.columns= ['meta', 'constituency','name','presense']


In [144]:
df_1.columns

Index(['meta', 'constituency', 'name', 'presense'], dtype='object')

In [145]:
df_2.columns

Index(['meta', 'constituency', 'name', 'presense'], dtype='object')

In [146]:
df_1.shape

(37247, 4)

In [147]:
df_2.shape

(38992, 4)

In [148]:
df = df_1.append(df_2)
df.shape

(76239, 4)

In [149]:
df.reset_index(inplace=True,drop=True)
df.head()

Unnamed: 0,meta,constituency,name,presense
0,,(Notice Office),,
1,"1st Joint Session held on Monday, the 17th Sep...",,,
2,The following Mambers National Assembly of Pak...,,,
3,Sl. No.,Constituency,Name of Member,Status
4,1,NA-1,Moulana Abdul Akbar Chitrali,P


In [150]:
cols=df.columns
cols

Index(['meta', 'constituency', 'name', 'presense'], dtype='object')

In [151]:
def print_header(string, delimeter = '*', times_count = 10):
    print(delimeter * times_count, string, delimeter * times_count)


print_header('info', times_count=15)
print(df.info())

print_header('is NaN', times_count=15)
print(df.isna().sum())

*************** info ***************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76239 entries, 0 to 76238
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   meta          75618 non-null  object
 1   constituency  74888 non-null  object
 2   name          75844 non-null  object
 3   presense      74560 non-null  object
dtypes: object(4)
memory usage: 2.3+ MB
None
*************** is NaN ***************
meta             621
constituency    1351
name             395
presense        1679
dtype: int64


In [152]:
# sample data
df.iloc[37040: 37350]


Unnamed: 0,meta,constituency,name,presense
37040,255,Reserved Seat,Mr. Ramesh Kumar Vankwani,P
37041,256,Reserved Seat,Mr. Jamshed Thomas,P
37042,257,Reserved Seat,Dr. Darshan,P
37043,258,Reserved Seat,Mr. Kesoo Mal Kheeal Das,P
37044,259,Reserved Seat,Mr. Naveed Aamir Jeeva,P
37045,,,NATIONAL ASSEMBLY SECRETARIAT,
37046,,,(Notice Office),
37047,20th Session,,"4th Sitting held on Friday, the 13th March, 2020",
37048,The following Members National Assembly of Pak...,,,
37049,Sl. No.,Constituency,Name of Member,Status


In [153]:
# is there any absent count present in the column?
df.loc[(df['presense'] == 'A') | (df['presense'] == 'a')]


Unnamed: 0,meta,constituency,name,presense
57019,32,NA-43,Mr. Noor-ul-Haq Qadri,A
72108,136,NA-192,Sardar Muhammad Khan Laghari,A


it seems there are only 2 absense present in the column, which makes it most unlikely that there were only 2 absense in the whole assembly attendance records.

It is more likely that only present members in the sitting were taken in each sitting.

TODO:: Check if the appearance of each name is equal to the number of total sittings in each session?

# **Data Cleaning**

Before applying any null drops, let's first segregate the sessions & joint sessions took place against each-
Approach:
1. We've first created new cols named 'session' & 'joint_session_number' against each occurance from cols[0] text
2. Next, we're simply broadcasting the session numbers over their applicable range of data
3. Lastly, dropping completely unnecessary rows

**End result:** Our df has session & joint_session number populated against each record


In [154]:
cols


Index(['meta', 'constituency', 'name', 'presense'], dtype='object')

In [155]:
df['meta']

0                                                      NaN
1        1st Joint Session held on Monday, the 17th Sep...
2        The following Mambers National Assembly of Pak...
3                                                  Sl. No.
4                                                        1
                               ...                        
76234                                                  155
76235                                                  156
76236                                                  157
76237                                                  158
76238                                                  159
Name: meta, Length: 76239, dtype: object

In [156]:
df[df['meta'].str.contains('(\d)[a-z]+ Session|session', na=False, regex=True)]
# test.sort_values(by=['meta', 'name'])
# [cols[0]].str.findall('(\d+)').apply(session)

  df[df['meta'].str.contains('(\d)[a-z]+ Session|session', na=False, regex=True)]


Unnamed: 0,meta,constituency,name,presense
633,1st Session,,"2nd Sitting held on Wednesday, the 15th August...",
968,1st Session,,"3rd Sitting held on Friday, the 17th August, 2018",
1845,2nd Session,,"7th Sitting held on Monday, the 1st October, 2018",
2087,2nd Session,,"8th Sitting held on Tuesday, the 2nd October, ...",
2374,2nd Session,,"9th Sitting held on Wednesday, the 3rd October...",
2674,2nd Session,,"1st Sitting held on Tuesday, the 18th Septembe...",
2960,2nd Session,,"2nd Sitting held on Monday, the 24th September...",
3214,2nd Session,,"3rd Sitting held on Tuesday, the 25th Septembe...",
3476,2nd Session,,"4th Sitting held on Wednesday, the 26th Septem...",
3726,2nd Session,,"5th Sitting held on Thursday, the 27th Septemb...",


In [157]:
df[df['meta'].str.contains('(\d)[a-z]+ Session|session', na=False, regex=True)]['meta'].str.findall('(\d+)').apply(session)

  df[df['meta'].str.contains('(\d)[a-z]+ Session|session', na=False, regex=True)]['meta'].str.findall('(\d+)').apply(session)


633       1
968       1
1845      2
2087      2
2374      2
2674      2
2960      2
3214      2
3476      2
3726      2
3986      2
4711      3
5246      4
5503      4
5675      4
5926      4
6171      4
6441      4
6720      4
6972      4
7262      4
7535      4
8089      5
8628      6
8887      6
9151      6
9417      6
9655      6
9835      6
10079     6
10337     6
10825     7
11099     7
11369     7
11642     7
11861     7
12051     7
12305     7
12576     7
12874     7
13136     7
13687     8
13926     8
14202     8
14483     8
14760     8
14982     8
15217     8
15427     8
15981     9
16226     9
16440     9
16640     9
16906     9
17200     9
17444     9
17746     9
18006     9
18270     9
18529     9
18751     9
18948     9
19202     9
19778    10
19988    10
20192    10
20412    10
20952    11
21204    11
21506    11
21772    11
22048    11
22343    11
22620    11
22895    11
23167    11
23368    11
23533    11
23820    11
24126    11
24457    11
24792    11
25130    11
2545

In [158]:
df.head()

Unnamed: 0,meta,constituency,name,presense
0,,(Notice Office),,
1,"1st Joint Session held on Monday, the 17th Sep...",,,
2,The following Mambers National Assembly of Pak...,,,
3,Sl. No.,Constituency,Name of Member,Status
4,1,NA-1,Moulana Abdul Akbar Chitrali,P


In [159]:
# df.loc[df['session'].astype(float) >= 1]


In [160]:
# df[df[cols[0]].str.contains('(\d)[a-z]+ joint|Joint', na=False, regex=True)][cols[0]].str.findall('(\d+)').apply(session)
# df['meta'].str.contains('(\d)[a-z]+ joint|Joint', na=False, regex=True)
df['session']= df[df['meta'].str.contains('(\d)[a-z]+ Session|session', na=False, regex=True)]['meta'].str.findall('(\d+)').apply(session)

df['joint_session'] = df[df['meta'].str.contains('(\d)[a-zA-Z]+ joint|Joint', regex=True, na=False)]['meta'].str.findall('(\d+)').apply(session).astype(float)


  df['session']= df[df['meta'].str.contains('(\d)[a-z]+ Session|session', na=False, regex=True)]['meta'].str.findall('(\d+)').apply(session)
  df['joint_session'] = df[df['meta'].str.contains('(\d)[a-zA-Z]+ joint|Joint', regex=True, na=False)]['meta'].str.findall('(\d+)').apply(session).astype(float)


In [161]:
df[df['meta'].str.contains('(\d+)[a-zA-Z]+ joint', regex=True, na=False, case=False)]['meta'].str.findall('(\d+)').apply(session).astype(float)

  df[df['meta'].str.contains('(\d+)[a-zA-Z]+ joint', regex=True, na=False, case=False)]['meta'].str.findall('(\d+)').apply(session).astype(float)


1         1.0
1301      2.0
1540      2.0
4215      3.0
4463      3.0
4970      4.0
7807      5.0
8324      6.0
10569     7.0
13365     8.0
15695     9.0
19441    10.0
20617    11.0
20788    11.0
Name: meta, dtype: float64

In [162]:
df.head()


Unnamed: 0,meta,constituency,name,presense,session,joint_session
0,,(Notice Office),,,,
1,"1st Joint Session held on Monday, the 17th Sep...",,,,,1.0
2,The following Mambers National Assembly of Pak...,,,,,
3,Sl. No.,Constituency,Name of Member,Status,,
4,1,NA-1,Moulana Abdul Akbar Chitrali,P,,


In [163]:
df.loc[df.joint_session.astype(float) >= 1, :]


Unnamed: 0,meta,constituency,name,presense,session,joint_session
1,"1st Joint Session held on Monday, the 17th Sep...",,,,,1.0
1301,2nd Joint\nSession,,"2nd Sitting held on Friday, the 1st March, 2019",,,2.0
1540,2nd Joint\nSession,,"1st Sitting held on Thursday, the 28th Februar...",,,2.0
4215,3rd Joint Session,,"1st Sitting held on Tuesday, the 6th August, 2019",,,3.0
4463,3rd Joint Session,,"2nd Sitting held on Wednesday, the 7th August,...",,,3.0
4970,4th Joint Session,,"1st Sitting held on Thursday, the 12th Septemb...",,,4.0
7807,5th Joint Session,,"1st Sitting held on Friday, the 14th February,...",,,5.0
8324,6th Joint Session,,"1st Sitting held on Thursday, the 6th August, ...",,,6.0
10569,7th Joint Session,,"1st Sitting held on Thursday, the 20th August,...",,,7.0
13365,8th Joint Session,,"1st Sitting held on Wednesday, the 16th Septem...",,,8.0


In [164]:
# broadcasting
df_copy = df.copy()
df_copy2 = df.copy()


In [165]:
df_copy.index


RangeIndex(start=0, stop=76239, step=1)

In [166]:
df_copy.head()


Unnamed: 0,meta,constituency,name,presense,session,joint_session
0,,(Notice Office),,,,
1,"1st Joint Session held on Monday, the 17th Sep...",,,,,1.0
2,The following Mambers National Assembly of Pak...,,,,,
3,Sl. No.,Constituency,Name of Member,Status,,
4,1,NA-1,Moulana Abdul Akbar Chitrali,P,,


In [167]:
# df_copy.loc[]
# pd.isna(df_copy.joint_session)
print(df_copy.joint_session[0])
type(df_copy.joint_session[0])


nan


numpy.float64

In [168]:
df_copy.iloc[1295:1350]

Unnamed: 0,meta,constituency,name,presense,session,joint_session
1295,325,Reserved Seat,Mr. Kesoo Mal Kheeal Das,P,,
1296,326,Reserved Seat,Mr. Ramesh Lal,P,,
1297,327,Reserved Seat,Mr. Naveed Aamir,P,,
1298,328,Reserved Seat,Mr. James Iqbal,P,,
1299,NATIONAL ASSEMBLY SECRETARIAT,,,,,
1300,(Notice Office),,,,,
1301,2nd Joint\nSession,,"2nd Sitting held on Friday, the 1st March, 2019",,,2.0
1302,The following Members National Assembly of Pak...,,,,,
1303,S. No,Constituency,Name,Status,,
1304,1,NA-1,Moulana Abdul Akbar Chitrali,P,,


In [172]:
df_copy.to_csv('data/modified/test.csv', index=False)

In [None]:
from numpy import nan
from pandas import isna, notna

# Broadcasting
df_copy= df.copy()
df_copy_ = df.copy()

# num='NaN'
num = nan
for i in df_copy.index:

    if isna(df_copy.iloc[i, 5]) & isna(df_copy.iloc[i, 4]):
        df_copy.iloc[i,5] = num


    if notna(df_copy.iloc[i,5]):
        num = str(df_copy.iloc[i,5])
    else:
        num=nan

    if str(df_copy.iloc[i,5]) == 'nan':
        if str(df_copy.iloc[i,4])=='nan':
            df_copy.iloc[i,5]=num

    if str(df_copy.iloc[i,5])!='nan':
        num= str(df_copy.iloc[i,5])
    else:
        num='NaN'
num='NaN'
for i in df_copy_.index:

    if str(df_copy_.iloc[i,4])=='nan':
        if str(df_copy_.iloc[i,5])=='nan':
            df_copy_.iloc[i,4]=num

    if str(df_copy_.iloc[i,4])!='nan':
        num= str(df_copy_.iloc[i,4])
    else:
        num='NaN'

In [None]:
df_copy['joint_session']

In [None]:
df['joint_session_number'] = df_copy['joint_session_number']
df['session'] = df_copy_['session']
