Skip to content

Latest commit

 

History

History
747 lines (623 loc) · 29.4 KB

spcrosswlk.md

File metadata and controls

747 lines (623 loc) · 29.4 KB

1. Stata to Pandas Cross-Walk

2. Introduction

This repo uses Stata example data sets to crosswalk a variety of Stata-Pandas equivalent code.

For pandas, this crosswalk assumes the following import statements:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
# Additional optional setting
pd.set_option('display.max_rows', 8)

3. Pendng & Unresolved Issues

4. Data Management

4.1. Starting Out

Description Stata Code Pandas Code
Load example data use http://www.stata-press.com/data/r15/auto2.dta exfile = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
Display data list exfile
Display subset of observations list in 1/10
or
list in 10/20
exfile.head(n=10)
or
exfile[9:20]
Display subset of variables list make price mpg trunk exfile[['make','price','mpg','trunk']]
Display subset of obs & vars list make price mpg trunk in 1/10 exfile[['make','price','mpg','trunk']].head(n=10)
Display specific observation list make price mpg trunk if _n == 10 The most Stata-like will be
exfile.iloc[9]
also available is df.loc when label-location based index is available.
Display observation(s) by logic list make price mpg trunk if mpg > 30 exfile[exfile['mpg'] > 30]
List variable names and/or get variable information desc
or
describe
exfile.dtypes and
exfile.describe() or
for var in exfile.columns:
     print(var)
Generate new text variable gen newtxt = "Some text here" exfile['newtxt'] = 'Some text here'
Change text variable value replace newtxt = "Newer text that is really looooooong." exfile['newtxt'] = 'Newer text that is really looooooong.'
Replace or change text variable based on existing variable replace newtxt = substr(newtxt, 1, 10)
or
gen newest = substr(newtxt, 1, 10)
exfile['newtxt'] = exfile['newtxt'].str.slice(0,10)
or
exfile['newest'] = exfile['newtxt'].str.slice(0,10)
or
exfile['newest'] = exfile['newtxt'].apply(lambda x: x[0:10])
Split text variables split make, parse(" ") exfile['ms'] = exfile['make'].str.split(' ') *
Replace text in variables replace make = subinstr(make," ","-",.) exfile['make'] = exfile['make'].str.replace(' ', '-')
Replace or change value based on existing values replace headroom == 9.0 if headroom == 3.0 exfile['headroom'] = exfile['headroom'].replace(3.0,9.0)
or
exfile['foreign'] = exfile['foreign'].replace('Domestic','USA')
Generate new int variable gen newnum = 10 exfile['newnum'] = 10
Change int variable value replace newnum = 222 exfile['newnum'] = 222
Generate dummy based on text variable gen fgn = (foreign == "Foreign") exfile['fgn'] = np.where(exfile['foreign']=='Foreign', 1, 0)
or
exfile['fgn'] = exfile['foreign'].apply(lambda x: 1 if x == 'Foreign' else 0)
Generate new variable relative to other variable gen rtd_disp = displacement / 10
or
gen mpg2 = mpg * mpg
or
gen lprice = ln(price)
exfile['rtd_disp'] = exfile['displacement'] / 10
or
exfile['mpg2'] = exfile['mpg'] * exfile['mpg']
or
exfile['lprice'] = np.log10(exfile['price'])
or
exfile['mpg2'] = [i * i for i in exfile['mpg']]
or
exfile['mpg2'] = exfile['mpg'].apply(lambda x: x * x)
or
exfile['mpg_weight'] = exfile.apply(lambda row: row['mpg']+row['weight'],axis=1)
Generate new variable equal to _n or index gen sorter = _n exfile['sorter'] = np.arange(0,len(exfile.index))
Rename a variable rename mpg milespg Two steps: exfile['milespg'] = exfile['mpg']
del exfile['mpg']
One steps:
exfile.rename(columns = {'mpg':'milespg'}, inplace=True)
Delete variable(s) drop newtxt newnum exfile = exfile.drop(columns=['newtxt','newnum'])
or
exfile = exfile.drop(['newtxt','newnum'], axis=1)
or
exfile.drop(columns="newtxt", inplace = True)
or
del exfile['newtxt']
del exfile['newnum']
Keep variable(s) keep make price mpg exfile = exfile[['make','price','mpg']]
Delete specific observation drop if _n = 10 exfile = exfile.drop(9)
Delete observation(s) by logic drop if mpg > 30
or
keep if mpg < 31
exfile = exfile[exfile['mpg'] < 31]
or
exfile = exfile[exfile['mpg'] > 30]
Sorting by a variable ascending sort price exfile = exfile.sort_values(by=['price'])
Sorting by a variable descending gsort -price exfile = exfile.sort_values(by=['price'], ascending=False)
Display summary statistics (specific variables) sum price mpg weight exfile[['price','mpg','weight']].describe()
or
exfile[['price','mpg','weight']].describe().T
Enstring numbers tostring price, gen(pricestr) exfile['price_str'] = exfile['price'].astype(str)
Destring strings destring pricestr, gen(pricenum) exfile['price_num'] = exfile['price_str'].astype(int)
  • Splits behave differently in Stata & Pandas. More development needed here.

A function that Stata does quickly and simplistically count, is not well implemented in Pandas.

set more off
clear all
use http://www.stata-press.com/data/r15/auto2.dta
count if foreign == 1
# Counting a categorical. Map the category you want to count into a series.
# Then use `series.count()` to get non NaN instatnces.
exfile = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')
exfile['foreign'].map({1:1}).count()

4.2. Categorical Factor Variables

Description Stata Code Pandas Code
Load example data use http://www.stata-press.com/data/r15/hbp2.dta exfile = pd.read_stata('http://www.stata-press.com/data/r15/hbp2.dta')
One-way tabulation tab year
or
tab race
exfile['year'].value_counts()
or
exfile['race'].value_counts()
Two-way tabulation tab year race pd.crosstab(exfile['year'], exfile['race'])
Two-way tagulation with row option that normalizes by row tab year race, row pd.crosstab(exfile['year'], exfile['race']).apply(lambda r: r/r.sum(), axis=1)

Also Consider pd.pivot_table() Crosstab Or Pivot Table (In Pandas) Deciding When to Use Which
Two-way tabulation with col option that normalizes by column tab year race, col pd.crosstab(exfile['year'], exfile['race']).apply(lambda r: r/r.sum(), axis=0)
Three-way tabulation table year race sex pd.crosstab(exfile['year'], [exfile['sex'], exfile['race']])
Encode a categorical (That was originally string) encode sex, gen(sex_cat) exfile['sex_cat'] = exfile['sex'].astype('category')
then
exfile['sex_cat_code'] = exfile['sex_cat'].cat.codes
Create an array of dummies from categorical tab sex, gen(sex_) exfile = pd.get_dummies(exfile, columns=['sex'])
Collapse an array of dummies back to a single categorical StataList Post: https://www.statalist.org/forums/forum/general-stata-discussion/general/1413286-convert-dummy-variables-into-a-categorical-variable?p=1413305#post1413305 df[['sex_','sex_female','sex_male']].idxmax(axis='columns')

Also, SOF Post: https://stackoverflow.com/a/51275990/9572143

Practice tip for those transitioning from Stata to Python. Where Stata lets you reference rows and columns with the very human readable optional arguments row and col, Python wants an axis number. To make Python code more human readable, possibly easier to read it is an option to declare a row and a col variable. An example that builds on the two-way tabulation examples above.

>>> import pandas as pd
>>> exfile = pd.read_stata('http://www.stata-press.com/data/r15/hbp2.dta')
>>> row = 1; col = 0
>>> 
>>> # Now user row and col variables instead of axis index.
>>> pd.crosstab(exfile['year'], exfile['race']).apply(lambda r: r/r.sum(), axis=row)

race     White     Black  Hispanic
year                              
1988  0.104167  0.812500  0.083333
1989  0.077670  0.796117  0.126214
1990  0.195652  0.686957  0.117391
1991  0.207547  0.664151  0.128302
1992  0.200873  0.620087  0.179039
1993  0.147410  0.701195  0.151394

>>> pd.crosstab(exfile['year'], exfile['race']).apply(lambda r: r/r.sum(), axis=col)

race     White     Black  Hispanic
year                              
1988  0.025510  0.050453  0.025478
1989  0.040816  0.106080  0.082803
1990  0.229592  0.204398  0.171975
1991  0.280612  0.227684  0.216561
1992  0.234694  0.183700  0.261146
1993  0.188776  0.227684  0.242038

4.3. Merge Datasets

Description Stata Code Pandas Code
Load example data use http://www.stata-press.com/data/r15/autoexpense.dta
and
use http://www.stata-press.com/data/r15/autosize.dta
autoexp = pd.read_stata('http://www.stata-press.com/data/r15/autoexpense.dta')
and
autosiz = pd.read_stata('http://www.stata-press.com/data/r15/autosize.dta')
Merge autoexpense autosize (using make as the key variable) After loading autosize.dta
merge 1:1 make using http://www.stata-press.com/data/r15/autoexpense.dta
pd.merge(autoexp,autosiz, on='make', how='outer')

Mismatched Defaults. By default Stata performs what Pandas would refer to as an outer merge. Meaning "use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically." cite. More simply, the result will include all records from both datasets.

The default in Pandas performs an inner merge which means "use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys." cite. Again more simply, the result will only include records that matched in both datasets.

For Stata users looking to replicate Stata's behavior on a merge operation it is necessary to specify the how='outer' argument in the pd.merge() statement.

The missing ouput. By default Stata performs the merge operation while also adding a variable called _merge which indicates for each observation if that observation was from the master dataset (equivalent to the left side), the using dataset (equivalent to the right side), or both datasets. Additionally, Stata provides output which can help verify the merge operation was successful. To replicate Stata's behavior in Pandas users can add the indicator=True argument to the pd.merge() statement. This indicator argument then adds a variable that Pandas also calls _merge which can then be tabulated or cross-tabulated with other variables to assess merge results. A full example, using the above:

autoexp = pd.read_stata('http://www.stata-press.com/data/r15/autoexpense.dta')
autosiz = pd.read_stata('http://www.stata-press.com/data/r15/autosize.dta')
df = pd.merge(autoexp,autosiz, on='make', how='outer', indicator=True)
df['_merge'].value_counts()

both          5
right_only    1
left_only     0
Name: _merge, dtype: int64

Another example using mock data:

import random
data1 = {'var1':[1,2,3,4,5],
         'dat1':[random.randrange(10,99,1) for i in range(5)]}
data2 = {'var1':[2,3,4,5,6],
         'dat2':[random.randrange(10,99,1) for i in range(5)]}
df = pd.merge(pd.DataFrame(data1), pd.DataFrame(data2), on='var1', 
              how='outer', indicator=True)
df['_merge'].value_counts()

both          4
right_only    1
left_only     1
Name: _merge, dtype: int64

4.4. Append Datasets

Description Stata Code Pandas Code
Load example data use http://www.stata-press.com/data/r15/capop.dta
and
use http://www.stata-press.com/data/r15/txpop.dta
capop = pd.read_stata('http://www.stata-press.com/data/r15/capop.dta')
and
txpop = pd.read_stata('http://www.stata-press.com/data/r15/txpop.dta')
Append CA population with TX population After loading txpop.dta
append using http://www.stata-press.com/data/r15/capop.dta
pd.concat([capop,txpop])
or
pd.concat([capop,txpop]).reset_index()
Append and mark sources append using http://www.stata-press.com/data/r15/capop.dta, generate(source) pd.concat([capop,txpop],keys=['ca','tx'])

4.5. Reshape Datasets

Description Stata Code Pandas Code
Load example data use http://www.stata-press.com/data/r15/reshape1.dta exfile = pd.read_stata('http://www.stata-press.com/data/r15/reshape1.dta')
Reshape from wide to long reshape long inc ue, i(id) j(year) exfile = pd.wide_to_long(exfile, stubnames=['inc','ue'], i=['id','sex'], j='year')
for more Stata-like then:
exfile = exfile.reset_index(0).reset_index(0).reset_index(0)
Reshape long to wide reshape wide inc ue, i(id) j(year) Quick version:
exfile2 = exfile.pivot_table(values=['sex','inc','ue'], columns='year', index='id')

While Pandas provides wide_to_long option, it does not provide a long_to_wide option. Below is code that will produce a long to wide reshape more consistent with Stata's results.

# Load example data that in long format.
exfile = pd.read_stata('http://www.stata-press.com/data/r15/reshape6.dta')

# Perpare wide dataframes for each variable that changes over j.
exfile['inc_idx'] = 'inc' + exfile.year.astype(str)
inc = exfile.pivot(index='id',columns='inc_idx',values='inc')

exfile['ue_idx'] = 'ue' + exfile.year.astype(str)
ue = exfile.pivot(index='id',columns='ue_idx',values='ue')

# Concatenate / Append individual wide datasets.
exfile2 = pd.concat([inc,ue],axis=1).reset_index()

# Gather values for varaibles that do not change over j.
exfile_sex = pd.DataFrame(exfile[['id','sex']])
exfile_sex = exfile_sex.pivot_table(index='id', values='sex').reset_index()

# Merge variables that do not change over j.
exfile3 = pd.merge(exfile_sex, exfile2, on='id')
exfile3

Another example from Dean McGrath who writes Pivoting A Pandas DataFrame at Towards Data Science. Version with adjustments:

import pandas as pd

# Example Long Data
df = pd.DataFrame(data={'staff_no': [9999] * 5,
                        'name': ['Dean'] * 5,
                        'year': [2016, 2017, 2018, 2019, 2020],
                        'hours': [349, 231, 876, 679, 976]})

# Convert Long To Wide
df = df.pivot(index='name', columns='year', values='hours').reset_index()

A wide to long example based on an article from Soner Yildirim who writes 8 Ways to Transform Pandas Dataframes at Towards Data Science.

import pandas as pd

# Example Long Data
df = pd.DataFrame(data={'staff_no': [9999] * 5,
                        'name': ['Dean'] * 5,
                        'year': [2016, 2017, 2018, 2019, 2020],
                        'hours': [349, 231, 876, 679, 976]})

# Convert Long To Wide
df = df.pivot(index='name', columns='year', values='hours').reset_index()

# Convert Wide To Long
pd.melt(df, id_vars='name').head()

4.6. Loops

Foreach Loops

Stata Version

use http://www.stata-press.com/data/r15/auto2.dta

foreach var in price mpg weight length {
    sum `var'
    gen z`var' = (`var' - r(mean)) / r(sd)
}

list price mpg weight length zprice zmpg zweight zlength in 1/5

Python Version

import pandas as pd
from scipy.stats import zscore

exfile = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')

for var in ['price','mpg','weight','length']:
    exfile['z{}'.format(var)] = exfile[[var]].apply(zscore)

exfile.head()

Forvalues Loops

Stata Version

clear all

input str10 Observer Day1 Day2 Day3
   "Adam" 3 7 8
   "Ken"  6 4 6
   "Zita" 7 6 4
   "Sam"  4 6 2
end

forvalues i = 1/3 {
   sum Day`i'
   gen zDay`i' = (Day`i' - r(mean)) / r(sd)
}

list

Python Version

import pandas as pd
from scipy.stats import zscore

df = pd.DataFrame({'Observer':['Adam','Ken','Zita','Ari','Sam'],
                  'Day1':[3,6,7,8,4],
                  'Day2':[7,4,6,5,6],
                  'Day3':[8,6,4,3,2]})

for i in range(1,4):
    df['zDay{}'.format(i)] = df[['Day{}'.format(i)]].apply(zscore)

df

4.7. Specialty Loops

Stata Version - Loop Through Files In Directory

// Get list of files that are csv
local files : dir . files "*.csv"

// Loop through file list and save each one in Stata .dta
foreach file in `files' {
    import delimited using "`file'"
    save "`file'.dta", replace
    clear
}

Python Version - Loop Through Files In Directory

import os
import pandas as pd

# Loop through files that are csv and save each in Stata .dta
for filename in os.listdir():
    if filename[-4:] == '.csv':
        pd.read_csv(filename).to_stata('{}.dta'.format(filename[:-4]))

Stata Version - Loop Through Levels Of A Categorical

sysuse auto

// Use the repair record categorical as an example
levelsof rep78, local(levels) 
foreach lev of local levels {
    // Preform any function specifc to each lev (summary stats for price)
    sum price if rep78 == `lev'
}

Python Version - Loop Through Levels Of A Categorical

import pandas as pd

df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta') 

# Use the repair record categorical as an example
for lev in df['rep78'].unique():
    # Preform any function specifc to each lev (summary stats for price)
    print(f'\n\nDisplay results for repair record {lev}')
    df[df['rep78'] == lev]['price'].describe()

5. Exporting Pandas data to Stata

5.1. Problems With Unicode

A problem that happens when saving to Stata is that pandas.DataFrame.to_stata sometimes writes unicode characters even though the format used by pandas.DataFrame.to_stata does not support unicode. Documented on this issue. A simplistc explanation is that unicode chracters can throw off the expected chracter count in Stata data. A crude solution is to make sure each chracter is only one chracter space:

# Define function that finds and replaces offensive characters.
def fix_char_ct(bad_text):
    ret_txt = ''
    for item in bad_text:
        ret_txt += item if len(item.encode(encoding='utf_8')) == 1 else ''
    return(ret_txt)

# Use apply to clean problematic text.
df['Problematic_Txt'] = df['Problematic_Txt'].apply(fix_char_ct)

5.2. Converting Object Data Types

Another frequent problem is that pandas.DataFrame.to_stata seems to have trouble writing the object data type. A solution to this trouble is:

# Define function that finds object data types, converts to string.
def obj_to_string(df):
    for obj_col in list(df.select_dtypes(include=['object']).columns):
        df[obj_col] = df[obj_col].astype(str)
    return(df)

# Pass dataframe with object data types to function.
df = obj_to_string(df)

5.3. Handling Column Names

Acceptable variable names in Stata is more limited than those in Pandas. To help pandas.DataFrame.to_stata is able to make corrections. However, sometimes the default corrections might not be preferred. A solution is to rename columns before writing to Stata:

# This function cleans a string so that only letters a-z and digits 0-9  remain. 
# Also removes spaces. Optional case argument controls variable name character case.
def clean_word(word, *, case='lower'):
    import re
    if case == 'lower':
        return(''.join(re.findall(r'[a-z|A-Z|0-9]', word.lower())))
    elif case == 'upper':
        return(''.join(re.findall(r'[a-z|A-Z|0-9]', word.upper())))
    elif case == 'asis':
        return(''.join(re.findall(r'[a-z|A-Z|0-9]', word)))
    else:
        raise Exception('Argument (case) incorrectly specified. \
                        Default is "lower" Alternate options \
                        are "upper" and "asis".')

# This funciton cleans list of column names so that only letters a-z and digits 0-9 
# remain. Also removes spaces. Makes sure each column name is unique. If duplicats
# present will print warning with explanation.
def clean_cols(clst, *, case='lower'):
    import warnings
    newcols = []
    for col in clst:
        newcols.append(clean_word(col, case=case))
    if len(clst) != len(set(newcols)):
        warnings.warn('\nDuplicates in column list. \
                      \nDuplicates appended with location.')
        newestcols = []
        suffix = 0
        for i in newcols:
            if newcols.count(i) > 1:
                newestcols.append(i + str(suffix))
            else:
                newestcols.append(i)
            suffix += 1
        return(newestcols)
    else:
        return(newcols)

# Using the above functions.
df.columns = clean_cols(df.columns)

# or

df.columns = clean_cols(df.columns, case='upper')

# or 

df.columns = clean_cols(df.columns, case='asis')

5.4 Writing Variable Labels

See here for a demonstration of writing Variable Labels.

6. Also useful

6.1. Summary Statistics

Quickly display table (a Pandas DataFrame) that lists variables, variable descriptions (variable labels), and summary statistics.

import pandas as pd

auto = 'http://www.stata-press.com/data/r15/auto.dta'
reader = pd.io.stata.StataReader(auto)
exfile = pd.read_stata(auto)

pd.merge(pd.DataFrame(reader.variable_labels(), index=['Label']).transpose(), 
         exfile.describe(include='all').transpose(), 
         left_index=True, 
         right_index=True)
Label count unique top freq mean std min 25% 50% 75% max
make Make and Model 74 74 Pont. Le Mans 1 NaN NaN NaN NaN NaN NaN NaN
price Price 74 NaN NaN NaN 6165.26 2949.5 3291 4220.25 5006.5 6332.25 15906
mpg Mileage (mpg) 74 NaN NaN NaN 21.2973 5.7855 12 18 20 24.75 41
rep78 Repair Record 1978 69 NaN NaN NaN 3.4058 0.989932 1 3 3 4 5
headroom Headroom (in.) 74 NaN NaN NaN 2.99324 0.845995 1.5 2.5 3 3.5 5
trunk Trunk space (cu. ft.) 74 NaN NaN NaN 13.7568 4.2774 5 10.25 14 16.75 23
weight Weight (lbs.) 74 NaN NaN NaN 3019.46 777.194 1760 2250 3190 3600 4840
length Length (in.) 74 NaN NaN NaN 187.932 22.2663 142 170 192.5 203.75 233
turn Turn Circle (ft.) 74 NaN NaN NaN 39.6486 4.39935 31 36 40 43 51
displacement Displacement (cu. in.) 74 NaN NaN NaN 197.297 91.8372 79 119 196 245.25 425
gear_ratio Gear Ratio 74 NaN NaN NaN 3.01486 0.456287 2.19 2.73 2.955 3.3525 3.89
foreign Car type 74 2 Domestic 52 NaN NaN NaN NaN NaN NaN NaN

6.2. Advanced Summary Stats With Pandas Profiling

This is an implementation of 10 Simple Hacks To Speed... Data Analysis... but with Stata's auto.dta. The article's implemenation was pre version 2.0.0. The imlemenation below is 2.0.0+ syntax.

pip install pandas-profiling

# or

conda install -c anaconda pandas-profiling
import pandas as pd
import pandas_profiling

df = pd.read_stata('http://www.stata-press.com/data/r15/auto2.dta')

profile = df.profile_report()
rejected = profile.get_rejected_variables()
df.profile_report(title='Stata Auto.dta Pandas Profiled',
                  correlation_overrides=[rejected])
profile.to_file(output_file='Stata_Auto.dta_Profile.html')

The Stata_Auto.dta_Profile.html output is available here.

6.3. External Resouces

7. Questions, Comments, & Contributions

Send me your questions, comments, contributions, and tell me what I did wrong.

Fork and pull requests welcome.