# NSS Data Cleaning

This notebook is to take the data provided by the Indian government and filter it, primarily to provide a map from households to districts within the state of Bihar.

In [7]:
!pip install pandas
import pandas as pd



In [8]:
df = pd.read_stata('Household characteristics - Block 3 - Level2 - type2 - 68.dta')

In [10]:
df['HHID']

0         715581201
1         715581202
2         715581203
3         715581204
4         715581301
            ...    
101646    454191104
101647    454191201
101648    454191202
101649    454191203
101650    454191204
Name: HHID, Length: 101651, dtype: object

### `dem_filter`

The `dem_filter` function takes the nss data, a list of attributes, and a district mapping, and returns a filtered, subsetted, and coded df for the given state. Codes used for household types, religion, and social class are from the NSS documentation.

In [11]:
def dem_filter(state_id, columns, district_map, df):
    
    # filter for state and selected columns
    # new = df.set_index('HHID')
    new = df[df['State_code'] == state_id]
    new = df[columns]
    
    # map districts
    new['District_map'] = new['District'].map(district_map)
    new = new.drop(['District'], axis=1)
    new.rename({'District_map': 'District'}, axis=1, inplace=True)
    
    # map household types 
    new['rural_urban'] = new['Sector'].map({'1':'rural','2':'urban'})
    new= new.drop(['Sector'], axis=1)
    
    # map religion
    new['Religion_map'] = new['Religion'].map({'1':'Hinduism','2':'Islam',
                                                                   '3':'Christianity','4':'Sikhism',
                                                                   '5':'Jainism','6':'Buddhism',
                                                                   '7':'Zoroastrianism','9':'Others'})
    new = new.drop(['Religion'], axis=1)
    new.rename({'Religion_map': 'Religion'}, axis=1, inplace=True)
    
    return new

In [12]:
state_id = 10

columns = ['HHID','District','hh_size','Religion', 'Sector']

district_map = {'01':'Pashchim Champaran','02':'Purba Champaran','03':'Sheohar','04':'Sitamarhi','05':'Madhubani','06':'Supaul',
             '07':'Araria','08':'Kishanganj','09':'Purnia','10':'Katihar','11':'Madhepura','12':'Saharsa',
             '13':'Darbhanga','14':'Muzaffarpur','15':'Gopalganj','16':'Siwan','17':'Saran','18':'Vaishali',
             '19':'Samastipur','20':'Begusarai','21':'Khagaria','22':'Bhagalpur','23':'Banka','24':'Munger',
             '25':'Lakhisarai','26':'Sheikhpura','27':'Nalanda','28':'Patna','29':'Bhojpur','30':'Buxar','31':'Kaimur (Bhabua)',
             '32':'Rohtas','33':'Aurangabad','34':'Gaya','35':'Nawada','36':'Jamui','37':'Jehanabad','38':'Arwal'}

df = dem_filter(state_id, columns, district_map, df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new['District_map'] = new['District'].map(district_map)


In [13]:
df

Unnamed: 0,HHID,hh_size,District,rural_urban,Religion
0,715581201,5,Pashchim Champaran,rural,Christianity
1,715581202,2,Pashchim Champaran,rural,Christianity
2,715581203,2,Pashchim Champaran,rural,Christianity
3,715581204,1,Pashchim Champaran,rural,Christianity
4,715581301,6,Pashchim Champaran,rural,Christianity
...,...,...,...,...,...
101646,454191104,4,Sitamarhi,urban,Hinduism
101647,454191201,6,Sitamarhi,urban,Islam
101648,454191202,3,Sitamarhi,urban,Hinduism
101649,454191203,7,Sitamarhi,urban,Islam


In [14]:
df.to_csv('nss_bihar.csv', index=True)