This notebook will look at census data from 1990. The main goal for this notebook is to preprocess the data by getting the column names from the "USCensus1990raw.attributes.txt" file and add the column names to the "USCensus1990raw.data.txt" file. At this point, unnecessary columns will be dropped and we will save the reduced dataset into a csv. The following columns will be saved:

    - ANCSTRY1
    - ANCSTRY2
    - RLABOR
    - WWII
    - WEEK89

In [30]:
import numpy as np
import pandas as pd

In [31]:
data_path_1990 = "data_1990/"
column_names_file = data_path_1990+"USCensus1990raw.attributes.txt"

column_list = [] # Will contain all column names

with open(column_names_file) as file:
    for line in file:
        temp_elements = ''            # Will hold string for current column name. Resets every new line
        
        # If the line starts with space, VAR, _ then it is not a line we want to read in
        if(line.startswith(' ') or line.startswith('VAR') or line.startswith('_') or line.startswith('\n')):
            continue
        else:
            print("\n")
        
        # Let's get each letter(element) in the line
        for element in line:  
            if(element == ' '): # If we get to a 'space' then we've reached the end of the column name, so break
                break
            else:
                print(element,end="")
                temp_elements += element # Save the current column name
            
        column_list.append(temp_elements)



AAGE

AANCSTR1

AANCSTR2

AAUGMENT

ABIRTHPL

ACITIZEN

ACLASS

ADEPART

ADISABL1

ADISABL2

AENGLISH

AFERTIL

AGE

AHISPAN

AHOUR89

AHOURS

AIMMIGR

AINCOME1

AINCOME2

AINCOME3

AINCOME4

AINCOME5

AINCOME6

AINCOME7

AINCOME8

AINDUSTR

ALABOR

ALANG1

ALANG2

ALSTWRK

AMARITAL

AMEANS

AMIGSTAT

AMOBLLIM

AMOBLTY

ANCSTRY1

ANCSTRY2

AOCCUP

APERCARE

APOWST

ARACE

ARELAT1

ARIDERS

ASCHOOL

ASERVPER

ASEX

ATRAVTME

AVAIL

AVETS1

AWKS89

AWORK89

AYEARSCH

AYRSSERV

CITIZEN

CLASS

DEPART

DISABL1

DISABL2

ENGLISH

FEB55

FERTIL

HISPANIC

HOUR89

HOURS

IMMIGR

INCOME1

INCOME2

INCOME3

INCOME4

INCOME5

INCOME6

INCOME7

INCOME8

INDUSTRY

KOREAN

LANG1

LANG2

LOOKING

MARITAL

MAY75880

MEANS

MIGPUMA

MIGSTATE

MILITARY

MOBILITY

MOBILLIM

OCCUP

OTHRSERV

PERSCARE

POB

POVERTY

POWPUMA

POWSTATE

PWGT1

RACE

RAGECHLD

REARNING

RECTYPE

RELAT1

RELAT2

REMPLPAR

RIDERS

RLABOR

ROWNCHLD

RPINCOME

RPOB

RRELCHLD

RSPOUSE

RVETSERV

SCHOOL

SEPT80

SERIALNO

SEX

S

In [32]:
print(column_list)

['AAGE', 'AANCSTR1', 'AANCSTR2', 'AAUGMENT', 'ABIRTHPL', 'ACITIZEN', 'ACLASS', 'ADEPART', 'ADISABL1', 'ADISABL2', 'AENGLISH', 'AFERTIL', 'AGE', 'AHISPAN', 'AHOUR89', 'AHOURS', 'AIMMIGR', 'AINCOME1', 'AINCOME2', 'AINCOME3', 'AINCOME4', 'AINCOME5', 'AINCOME6', 'AINCOME7', 'AINCOME8', 'AINDUSTR', 'ALABOR', 'ALANG1', 'ALANG2', 'ALSTWRK', 'AMARITAL', 'AMEANS', 'AMIGSTAT', 'AMOBLLIM', 'AMOBLTY', 'ANCSTRY1', 'ANCSTRY2', 'AOCCUP', 'APERCARE', 'APOWST', 'ARACE', 'ARELAT1', 'ARIDERS', 'ASCHOOL', 'ASERVPER', 'ASEX', 'ATRAVTME', 'AVAIL', 'AVETS1', 'AWKS89', 'AWORK89', 'AYEARSCH', 'AYRSSERV', 'CITIZEN', 'CLASS', 'DEPART', 'DISABL1', 'DISABL2', 'ENGLISH', 'FEB55', 'FERTIL', 'HISPANIC', 'HOUR89', 'HOURS', 'IMMIGR', 'INCOME1', 'INCOME2', 'INCOME3', 'INCOME4', 'INCOME5', 'INCOME6', 'INCOME7', 'INCOME8', 'INDUSTRY', 'KOREAN', 'LANG1', 'LANG2', 'LOOKING', 'MARITAL', 'MAY75880', 'MEANS', 'MIGPUMA', 'MIGSTATE', 'MILITARY', 'MOBILITY', 'MOBILLIM', 'OCCUP', 'OTHRSERV', 'PERSCARE', 'POB', 'POVERTY', 'POWPUMA'

In [33]:
raw_data = pd.read_csv(data_path_1990+"Raw_Reduced_Data1990.txt",delimiter="\t",names=column_list)
raw_data.head()

Unnamed: 0,AAGE,AANCSTR1,AANCSTR2,AAUGMENT,ABIRTHPL,ACITIZEN,ACLASS,ADEPART,ADISABL1,ADISABL2,...,TMPABSNT,TRAVTIME,VIETNAM,WEEK89,WORK89,WORKLWK,WWII,YEARSCH,YEARWRK,YRSSERV
0,0,0,0,0,0,0,0,0,0,1,...,3,0,0,28,1,2,0,8,2,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,5,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,60,1,52,1,1,0,15,1,4
3,0,0,0,0,0,0,0,0,0,0,...,3,0,0,52,1,2,0,11,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,10,0,26,1,1,0,11,1,0


Let's now remove all of the columns we don't want. 

Columns to keep:
    - ANCSTRY1
    - ANCSTRY2
    - RLABOR
    - WWII
    - WEEK89

In [34]:
# Save a list of the indices of the column names
col_list = [raw_data.columns.get_loc("ANCSTRY1"),raw_data.columns.get_loc("ANCSTRY2"),raw_data.columns.get_loc("RLABOR"),raw_data.columns.get_loc("WWII"),raw_data.columns.get_loc("WEEK89")]
col_list

[35, 36, 102, 121, 118]

In [35]:
# Only keep columns of data we want to analyze
data = raw_data.iloc[:,col_list]

In [36]:
data.head()

Unnamed: 0,ANCSTRY1,ANCSTRY2,RLABOR,WWII,WEEK89
0,999,999,3,0,28
1,999,999,0,0,0
2,50,999,1,0,52
3,32,22,6,0,52
4,50,32,1,0,26


In [37]:
print(data.shape[0],",",data.shape[1])

12291 , 5


In [38]:
data.to_csv("PreprocessedData/preprocessed_data.csv")