# Demo: Analyzing Categorical Columns

### 1. Import Packages and Connect to the CAS Server

Visit the documentation for the SWAT [(SAS Scripting Wrapper for Analytics Transfer)](https://sassoftware.github.io/python-swat/index.html) package.

In [None]:
## Import packages
import swat
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn')

## Set options
pd.set_option('display.max_columns', None)

## Connect to CAS
conn = swat.CAS('server.demo.sas.com', 30571, 'student', 'Metadata0', name = 'py03d03')

## Function to load the loans_raw.sashdat file into memory if necessary
def loadLoans():
    conn.loadTable(path ='loans_raw.sashdat', caslib = 'PIVY',
                   casOut = {'name' : 'loans_raw',
                            'caslib' : 'casuser',
                            'promote' : True})

### 2. Explore Available CAS Tables

a. Use the tableInfo action to view all available in-memory tables in the **Casuser** caslib. If the **LOANS_RAW** CAS table is not available, uncomment the loadLoans function and execute the cell.

In [None]:
#loadLoans()
conn.tableInfo(caslib = 'casuser')

b. Reference the **LOANS_RAW** CAS table using the CASTable method, and preview the table using the head method.

In [None]:
tbl = conn.CASTable('loans_raw', caslib = 'casuser')
tbl.head()

### 3. Frequency Distribution Using the value_counts Method

a. Use the value_counts method on the **Category** column to view frequency values of each category, and store the results in the variable **vc_df**. In the value_counts method, use the normalize equals *True* option. Display the **vc_df** object type and value of the object. The CAS server summarizes the data and returns a **Series** to the client.

In [None]:
vc_df = (tbl
         .Category
         .value_counts(normalize = True))

## Display the object type and view the results
display(type(vc_df), vc_df)

b. With a **Series** object returned from CAS to the client, you can use the traditional Pandas plot method to visualize the summarized results.

In [None]:
vc_df.plot(kind = 'bar', figsize = (8,6), title = 'Percentage of Loans by Category');

### 4. Frequency Distribution Using the freq Action

a. You can use the [simple.freq](https://documentation.sas.com/doc/en/pgmsascdc/v_018/casanpg/cas-simple-freq.htm) action to obtain the frequency distribution of multiple columns. The freq action returns a **CASResults** object with a single **SASDataFrame**. Here, the **Category** and **LoanGrade** columns are specified in the inputs parameter.

In [None]:
freq_results = tbl.freq(inputs = ['Category','LoanGrade'])
freq_results

b. With a **CASResult** object on the client, you can reference the *Frequency* key to store the **SASDataFrame** in a variable named **freq_df**. Then confirm the type and value of the **freq_df** variable. Notice that it's a **SASDataFrame** with the frequency values for both the **Category** and **LoanGrade** columns.

In [None]:
freq_df = freq_results['Frequency']

## Display the object type and view the results
display(type(freq_df), freq_df)

c. Once you have the **SASDataFrame** on the client, you can use the Pandas package to visualize the summarized results. Here, create two new **DataFrames** for each unique input column. Then visualize each **DataFrame**. All code below is Pandas code, and it's processed on the client because you are working with a **SASDataFrame**.

In [None]:
## DataFrame of the Category rows
categorydf = (freq_df
              .query('Column == "Category"')
              .sort_values('Frequency', ascending = False))

## DataFrame of the LoanGrade rows
loanGradedf = (freq_df
               .query('Column == "LoanGrade"')
               .sort_values('Frequency', ascending = False))

## Display the DataFrames
display(categorydf, loanGradedf)


##
## Plot the SASDataFrames
##
fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize = (16,6))

## ax1
categorydf.plot(kind = 'bar', x = 'CharVar', y = 'Frequency', 
                ax = ax1, title = "Frequency of Category", xlabel = '')

## ax2
loanGradedf.plot(kind = 'bar', x = 'CharVar', y = 'Frequency', 
                 ax = ax2, title = "Frequency of Loan Grade",  xlabel = '');

d. You can also create a calculated column for ad hoc analysis in an action or method by adding the computedVarsProgram parameter to the **CASTable** object. Here, a new column is created named **InterestCat**. It uses the SAS IFC function to create a new column based on conditions. If the current value of **InterestRate** is *0*, then there was *No Interest*. Otherwise, the value is *Interest*. Then use the **tbl** object with the freq action. Specify the inputs parameter with the new **InterestCat** column.

In [None]:
## Create a new column
tbl.computedVarsProgram = "InterestCat = ifc(InterestRate = 0, 'No Interest', 'Interest');"
display(tbl)

## Analyze the new column
cr_freq = tbl.freq(inputs='InterestCat')
display(cr_freq)

## Delete the computedVarsProgram parameter
del tbl.computedVarsProgram
display(tbl)

### 5. Frequency Distribution Using the freqTab Action

a. The [freqTab.freqTab](https://go.documentation.sas.com/doc/en/pgmsascdc/v_017/casactstat/cas-freqtab-TblOfActions.htm?homeOnFail) action provides much more functionality than the freq action. To use the freqTab action set, you first need to load the freqTab action set using the [builtins.loadActionSet](https://go.documentation.sas.com/doc/en/pgmsascdc/v_017/caspg/cas-builtins-loadactionset.htm) action.

In [None]:
conn.loadActionSet('freqtab')

b. Use the freqTab action to create a simple one-way frequency table similar to the freq action. Begin by creating a frequency table of the **Category** column by using the tabulate parameter. Notice that the results of the freqTab action return a variety of information such as the level information, number of observations, the one-way frequency table, and timing.

In [None]:
ft_cr = tbl.freqTab(tabulate = 'Category')
ft_cr

c. View the keys of the **CASResults** object using the keys method. You see that four keys exist in the **CASResults** object.

In [None]:
ft_cr.keys()

d. To visualize the one-way frequency table, call the *Table1.OneWayFreqs* key to return the **SASDataFrame**. Then use Pandas to sort and plot the **SASDataFrame**.

In [None]:
## Store the SASDataFrame from the CASResults objects
freq_df = ft_cr['Table1.OneWayFreqs']

## Process the SASDataFrame using Pandas
(freq_df
 .sort_values('Percent', ascending = False)
 .plot(kind='bar', x = 'Category', y = 'Percent', figsize = (8,6)));

e. The freqTab action enables you to create as many frequency and crosstabulation tables as you would like within a single action by specifying a list in the tabulate parameter. Here, the freqTab action creates a one-way frequency table for **Category** and **LoanGrade**, and a crosstabulation of **Category** and **LoanGrade**. Store the results in the **ft** variable and display the results.

**Note**: To create crosstabulation, tables you specify a list with a dictionary within the tabulate parameter. Use the key *vars*, followed by a list of columns to use for the crosstabulation table.

In [None]:
ft = tbl.freqTab(tabulate = [
                   'Category',
                   'LoanGrade',
                   {'vars' : ['Category','LoanGrade']},
                ])

display(ft)

f. To view the first 15 rows of the crosstabulation, call the *Table3.CrossList* key from the **CASResults** object to store the **SASDataFrame**. Then execute the head method.

In [None]:
ft['Table3.CrossList'].head(15)

### 6. Creating Crosstabs Using the crossTab Action

a. The [simple.crossTab](https://go.documentation.sas.com/doc/en/pgmsascdc/v_017/casanpg/cas-simple-crosstab.htm) action performs a one-way or two-way tabulation. Here, the crossTab action creates a two-way tabulation between **Category** and **LoanGrade**. Use the row parameter to specify the row, and the col parameter to specify the column. Store the **SASDataFrame** from the result of the crossTab action in the variable **cross_df** and view the results. Notice that the results of the crossTab action do not name the columns by default.

In [None]:
cross_df = tbl.crosstab(row = 'Category', col = 'LoanGrade')['Crosstab']
cross_df

b. A **SASDataFrame** contains a variety of additional attributes and methods. One useful attribute is the colInfo attribute. It enables you to view column information of the **SASDataFrame** like the column name, labels and data type. When you execute the colInfo attribute, notice that the *label* key of each column contains the **LoanGrade** value. You can use this information to rename the default column names of the crossTab action.

In [None]:
## View the SASDataFrame column attributes. This is an additional attribute available with SASDataFrames
cross_df.colinfo

c. You can use the apply_labels method with the inplace equals *True* parameter to apply the column labels as the column names. Execute the cell and notice that the column names have changed.

In [None]:
## Apply the labels of the SASDataFrame as the column names
cross_df.apply_labels(inplace = True)
cross_df

d. In the crossTab action, you can add the aggregator and weight parameters to summarize the data. Here, the mean **InterestRate** is calculated for each **Category** by **LoanGrade**. Then rename the default columns using the apply_labels method.

In [None]:
## Summarize the data in CAS
cross_df = tbl.crosstab(row = 'Category', 
                        col = 'LoanGrade', 
                        aggregator = 'MEAN', 
                        weight = 'InterestRate')['Crosstab']


## Rename the columns with the labels
cross_df.apply_labels(inplace = True)
cross_df

### 7. Terminate the CAS Session

It's best practice to always terminate the CAS session when you are done.

In [None]:
conn.terminate()