# Using pyCSPro

pyCSPro is a simple python library made up of two main functionalities (classes). 
The first one is the DictionaryParser class which is responsible for parsing a CSPro dictionary and also providing anciliary functions such as providing lables of record columns (could be used to replace the default column names which are the name attributes of items and therefore could be cryptic) and labels of values (this could be used to replace values such as 1, 2 with their respective lables like 'Male', 'Female' etc.

## Install the package

In [None]:
!pip install --user pycspro

## Parse a dictionary

Here, we are parsing the sample dictionary that is provided with CSPro and can also be downloaded from this repo

https://github.com/CSProDevelopment/examples

In [9]:
from pycspro import DictionaryParser

raw_dictionary = open('dictionary/Census Dictionary.dcf', 'r').read()
dictionary_parser = DictionaryParser(raw_dictionary)
parsed_dictionary = parser.parse()
print(json.dumps(parsed_dictionary, indent=4))

{
    "Dictionary": {
        "Name": "CEN2000",
        "Label": "Popstan Census",
        "Note": "",
        "Version": "CSPro 7.2",
        "RecordTypeStart": 1,
        "RecordTypeLen": 1,
        "Positions": "Relative",
        "ZeroFill": true,
        "DecimalChar": false,
        "Languages": [],
        "Relation": [],
        "Level": {
            "Name": "QUEST",
            "Label": "Questionnaire",
            "Note": "",
            "IdItems": [
                {
                    "Name": "PROVINCE",
                    "Label": "Province",
                    "Note": "",
                    "Len": 2,
                    "ItemType": "Item",
                    "DataType": "Numeric",
                    "Occurrences": 1,
                    "Decimal": 0,
                    "DecimalChar": false,
                    "ZeroFill": true,
                    "OccurrenceLabel": [],
                    "Start": 2,
                    "ValueSets": [
                        {
 

## Use parsed dictionary to parse cases

We pull out cases from the CSPro example data file. Luckily, the given example is of a single record type and therefore newlines (\n) are only found at the end of a single case entry and therefore we can use that to cut up the content of the file into individual cases. If it were was a situation where there were multiple records then those would also have been separated by the newline character and we wouldn't have been able to use it to cut up the file into individual records.

The case parser accepts a list of cases. We can choose to pass a single case in a list or as many as 100k.
The best approach would be to pass in about 50k and then convert the returned dictionary into a Pandas Data Frame, then we pass in the next batch and then convert that into another data frame and then append it to the previous data frame.

In [14]:
import pandas as pd
from pycspro import CaseParser

raw_cases = open('data/Popstan Census.dat', 'r').read()
cases = raw_cases.split('\n')
case_parser = CaseParser(parsed_dictionary)
parsed_cases = case_parser.parse(cases[:10])
dfs = {}
for table_name, table in parsed_cases.items():
    dfs[table_name] = pd.DataFrame.from_dict(table)
    print(table_name)
    display(dfs[table_name])

QUEST


Unnamed: 0,CASE_ID,PROVINCE,DISTRICT,VILLAGE,EA,UR,BUILDING,HU,HH
0,10103000110210241,1,1,30,1,1,21,24,1
1,10103000110210241,1,1,30,1,1,21,24,1
2,10103000110210241,1,1,30,1,1,21,24,1
3,10103000110210241,1,1,30,1,1,21,24,1
4,10117100110870031,1,1,171,1,1,87,3,1
5,10117100110870031,1,1,171,1,1,87,3,1
6,10117100110870031,1,1,171,1,1,87,3,1
7,10117100110870031,1,1,171,1,1,87,3,1
8,10117100110870031,1,1,171,1,1,87,3,1
9,10117100110870031,1,1,171,1,1,87,3,1


PERSON


Unnamed: 0,CASE_ID,LINE,P02_REL,P03_SEX,P04_AGE,P05_MS,P06_MOTHER,P07_BIRTH,P08_RES95,P09_ATTEND,...,P14_WHY_NOT,P15_OCC,P15A_OCC,P16_IND,P16A_IND,P17_WK_STATUS,ECON_ACTIVE,P18_BORN,P19_LIVING,P20_BORN12
0,10103000110210241,1,1,2,19,5,1,1,1.0,2.0,...,,93.0,9.0,29.0,2.0,2.0,1.0,0.0,0.0,0.0
1,10103000110210241,2,5,2,7,5,1,1,1.0,1.0,...,,,,,,,,,,
2,10103000110210241,3,5,1,4,5,1,1,,,...,,,,,,,,,,
3,10117100110870031,1,1,1,40,1,1,1,3.0,2.0,...,,91.0,9.0,95.0,9.0,2.0,1.0,,,
4,10117100110870031,2,2,2,41,1,1,1,3.0,2.0,...,,91.0,9.0,95.0,9.0,2.0,1.0,12.0,12.0,1.0
5,10117100110870031,3,3,1,20,5,1,1,3.0,1.0,...,,,,,,9.0,2.0,,,
6,10117100110870031,4,3,2,15,5,1,1,3.0,1.0,...,,,,,,9.0,2.0,0.0,0.0,0.0
7,10117100110870031,5,3,1,13,5,1,1,3.0,1.0,...,,,,,,9.0,2.0,,,
8,10117100110870031,6,3,1,11,5,1,1,1.0,1.0,...,,,,,,9.0,2.0,,,


HOUSING


Unnamed: 0,CASE_ID,H01_TYPE,H02_WALL,H03_ROOF,H04_FLOOR,H05_ROOMS,H06_TENURE,H07_RENT,H08_TOILET,H09_BATH,H10_WATER,H11_LIGHT,H12_FUEL,H13_PERSONS
0,10103000110210241,1,1,2,3,2,1,0,5,2,2,1,5,3


## Changing column labels

In [13]:
person = dfs['PERSON']
#person.replace(dictionary_parser.get_value_labels('PERSON'))
print(dictionary_parser.get_value_labels('PERSON'))

None


## Changing value lables