<h1>SAS to Pandas Dictionary</h1><h3> A guide for using python from a SAS background. Part 1: getting and summarizing data </h3> 
<span>&ensp;&ensp; The easiest way to get python with all the needed libraries and development environments is to download <a href="https://docs.continuum.io/anaconda/install">ANACONDA</a>. Due to IT constraints the author had to do analysis in Python 2.7. The way of the future, and present, is Python 3. It is no longer debateable; developent of code should be done in the latest version of Python 3. Fortunately, for most data scientist and risk analysts, the difference between 2.7 and 3 is small. 

&ensp;&ensp; These examples are run in the Jupyter development environment provided with Anaconda. This environment allows you to create a document with executeable code. You can display code, tables and graphs along with formatted text. Thus anyone(including you) can recreate your results. The environment is easy to use. An internet seach will yield tutorials on using Jupyter to present your work and create interactive documents
</span>

**getting started** The best python library to analyze data is Pandas. It is optimized to execute quickly and provides easy commands. The syntax for importing is below. The standard is to give it the alias pd The second line is to display output in the document.

In [1]:
import pandas as pd
from IPython.display import display

Let's start by creating a simple data set to explore

### SAS data step with CARDS =  Pandas DataFrame
&ensp;&ensp;In SAS manually creating data is typically done with the cards or datalines command. For Pandas you use the DataFrame method. This data set is used in the early examples. For an explanation on loading files see "Loading Data" later in this document. 


**SAS**

    DATA sasdat;  
        INPUT segment $ revenue  loss ;  
        DATALINES;  
    subprime  5 -1  
    midprime  4  0  
    prime     3  0  
    midprime  5  -2  
    prime     4  0  
    midprime  5 0  
    subprime  6  0  
   
    PROC PRINT; RUN;  

**PANDAS**  
Data sets in Pandas are referred to as data frames. Use the DataFrame method to create a new dataset.

In [2]:
df=pd.DataFrame({'segment': ['subprime', 'midprime', 'prime', 'midprime', 'prime', 'midprime', 'subprime']
                ,'revenue': [5,4,3,5,4,5,6]
                ,'loss': [-1, 0, 0, -2, 0, 0, 0]})

In the above code you create a dataframe called df and it has three variables: segment, revenue and loss. The values associated with variables are in the square brackets. To view the dataframe type either its name or 

In [3]:
display(df)

Unnamed: 0,loss,revenue,segment
0,-1,5,subprime
1,0,4,midprime
2,0,3,prime
3,-2,5,midprime
4,0,4,prime
5,0,5,midprime
6,0,6,subprime


Add some variables:

In [4]:
df['debt']=['high','low', 'high', 'low', 'low', 'low', 'low']
df['sales']=['low','medium', 'high', 'medium', 'medium', 'low', 'low']

### PROC CONTENTS  = columns and shape 
&ensp;&ensp;proc contents provides two key peices of information: variable names and number of observations

&ensp;&ensp;To get the variable names in Pandas call the columns attribute. To get the number of observations use the shape attribute.

**SAS**

    proc contents data=sasdat; 
    run;

**PANDAS**  
To get the variable names just call the data frames attribute "columns"

In [5]:
df.columns

Index([u'loss', u'revenue', u'segment', u'debt', u'sales'], dtype='object')

To get the number of observations use the shape attribute. The first value is the number of observations. The second is the number of variables

In [6]:
df.shape

(7, 5)

In [7]:
obs, columns= df.shape
print 'the number of observations are ' + str(obs)
print 'the number of columns are ' + str(columns)

the number of observations are 7
the number of columns are 5


### PROC PRINT = head

**SAS**

    proc print data=sasdata (obs=2);
    var segment;
    run;

**PANDAS**  
There are a couple of ways to view particular values in pandas. One way is to use the heads method

In [8]:
df.segment.head(n=2)

0    subprime
1    midprime
Name: segment, dtype: object

### PROC Summary = groupby

**SAS**

    proc summary data=sasdat sum;
        class segment;
        var profit;
        output out=newdat;
    run;


**PANDAS**  
When trying to get descriptive values by different classes or segments use groupby method with mean, sum, count, ...

In [9]:
newdat=df.groupby('segment', as_index=False).sum()

In [10]:
df['loss'].fillna(0, inplace=True)

    inplace=True condition makes the code compact. Without the condition the following would have to be written:
        df=df['loss'].fillna(0)

### PROC FREQ= groupby or unique

**SAS**       

    proc freq data=sasdat;
        tables segment;
    run;

**PANDAS**

In [11]:
df.groupby('segment').count()

Unnamed: 0_level_0,loss,revenue,debt,sales
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
midprime,3,3,3,3
prime,2,2,2,2
subprime,2,2,2,2


**or**

In [12]:
df['segment'].unique()

array(['subprime', 'midprime', 'prime'], dtype=object)

### PROC FREQ multple variables = pivot_table

**SAS**

    proc freq data= sasdat; 
        tables segment*debt;
    run;

**PANDAS**  
in this example using the pivot table method to get the volume in each segment by debt classification

In [13]:
df.pivot_table(values='loss', index='segment', columns='debt', margins=True, aggfunc='count')

debt,high,low,All
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
midprime,,3.0,3.0
prime,1.0,1.0,2.0
subprime,1.0,1.0,2.0
All,2.0,5.0,7.0


## PROC MEANS = describe

**SAS**

    proc means data=sasdata;
        var loss revenue;
    run;

**PANDAS** 

to get basic statistics in pandas use the describe method

In [14]:
df.describe()

Unnamed: 0,loss,revenue
count,7.0,7.0
mean,-0.428571,4.571429
std,0.786796,0.9759
min,-2.0,3.0
25%,-0.5,4.0
50%,0.0,5.0
75%,0.0,5.0
max,0.0,6.0


You can also just pick the variables you want describe. The code below will return just the summary statistics for the variable loss

In [15]:
df.loss.describe()

count    7.000000
mean    -0.428571
std      0.786796
min     -2.000000
25%     -0.500000
50%      0.000000
75%      0.000000
max      0.000000
Name: loss, dtype: float64

### PROC MEANS several variables = pivot_table or groupby

**SAS**

    proc means data=sasdata sum;
        class segment debt;
        var loss revenue;
    run;

**PANDAS**  
get the sum by the class variables

In [16]:
df.pivot_table(values='loss', index='segment', columns='debt', margins=True, aggfunc='sum')

debt,high,low,All
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
midprime,,-2.0,-2.0
prime,0.0,0.0,0.0
subprime,-1.0,0.0,-1.0
All,-1.0,-2.0,-3.0


Value is the analysis variable where we set it equal to profit. The rows are the index this time set to the variable segment. Columns are columns set to debt. Setting margins=True means we will get row and column totals. Finally we want to sum the profit and thus set the aggregation function equal to sum aggfunc='sum'

mean

In [17]:
df.pivot_table(values='loss', index='segment', columns='debt', margins=True, )

debt,high,low,All
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
midprime,,-0.666667,-0.666667
prime,0.0,0.0,0.0
subprime,-1.0,0.0,-0.5
All,-0.5,-0.4,-0.428571


max

In [18]:
df.pivot_table(values=['loss','revenue'], index='segment', columns=['debt','sales'], margins=True, aggfunc='max')

Unnamed: 0_level_0,loss,loss,loss,loss,loss,revenue,revenue,revenue,revenue,revenue
debt,high,high,low,low,All,high,high,low,low,All
sales,high,low,low,medium,Unnamed: 5_level_2,high,low,low,medium,Unnamed: 10_level_2
segment,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
midprime,,,0.0,0.0,0.0,,,5.0,5.0,5.0
prime,0.0,,,0.0,0.0,3.0,,,4.0,4.0
subprime,,-1.0,0.0,,0.0,,5.0,6.0,,6.0
All,0.0,-1.0,0.0,0.0,0.0,3.0,5.0,6.0,5.0,6.0


groupby will yield results in a format similar to proc mean

In [19]:
df.groupby(['segment','debt', 'sales']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,loss,revenue
segment,debt,sales,Unnamed: 3_level_1,Unnamed: 4_level_1
midprime,low,low,0,5
midprime,low,medium,-2,9
prime,high,high,0,3
prime,low,medium,0,4
subprime,high,low,-1,5
subprime,low,low,0,6


# Loading Data
<h4> SAS DATA STEP=Pandas read\_sas  </h4>
<span><h4>Proc IMPORT=read_xxxx</h4>

&ensp;&ensp;loading different file formats(.sas7bdat, .csv, xlsx, json,etc.)into dataframes is to use the .read_sas, .read_csv, .read_excel methods in pandas.

&ensp;&ensp;The following code uploads the data from the internet to the computer. This code is just to get the data that will be used in the examples. </span>


In [20]:
## getting an example of sas7bdat data and csv data. 
import urllib

testfile = urllib.URLopener()
testfile.retrieve("http://www.principlesofeconometrics.com/sas/hhsurvey.sas7bdat", "/tmp/sas_data.sas7bdat")
testfile.retrieve("http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv", "/tmp/csv_data.csv")

('/tmp/csv_data.csv', <httplib.HTTPMessage instance at 0x7fbbab29d7a0>)

<h3> SAS DATA STEP=Pandas read\_sas  </h3>

**SAS**  

    LIBNAME source '/tmp/';
    
    DATA newdata;
      set source.sas_data;
    run;

**PANDAS**

Since the file is sas7bdat format you use the read_sas method. This method also suports XPORT format.

In [21]:
dataframe_from_sas=pd.read_sas( "/tmp/sas_data.sas7bdat")

using the head method to check that the data is loaded:

In [22]:
dataframe_from_sas.head()

Unnamed: 0,A,ALCOH,FOOD,K,TRPORT,X
0,3.0,8.99,157.050003,0.0,80.510002,692.0
1,2.000358,17.75,70.779999,0.0,40.720001,272.0
2,2.000006,2.97,177.199997,0.0,29.309999,1130.0
3,2.0,13.5,75.110001,2.0,38.110001,535.0
4,2.0,47.41,147.889999,0.0,108.269997,767.0


<h3> SAS PROC IMPORT = Pandas read\_xxxx  </h3>

**SAS**

    proc import datafile="/tmp/csv_data.csv" out=newdata dbms=csv replace;
        getnames=yes;
    run;

**PANDAS**

    pandas can read multiple formats into dataframes.examples of methods to read data:
       read_csv
       read_sas
       read_excel
       read_clipboard
       read_stata
       read_jstor
&ensp;&ensp;&ensp;   To get a complete list of instructions for importing data refer to the <a href="http://pandas.pydata.org/pandas-docs/stable/api.html#input-output">DOCUMENTATION</a>

    below is an example with csv data

In [23]:
dataframe_from_csv=pd.read_csv("/tmp/csv_data.csv")
dataframe_from_csv.head()

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768
