<hr />
<h3 style="color:cornflowerblue;">Panda Recap 1</h3>
<p>A quick review of some main points from this week&rsquo;s asynch to be sure we know the basics of Pandas.  So let's test a few things before importing Ruth's the salary data file.</p>
<p>Creating and exploring the data and indices.  Data can come from comma, tab, or other-delimited text files (like .csv, .tsv, .dat, .txt) or semi-structured text formats (.json, .xml), or via servers using a URIs (universal resource identifier, of which URLs (- location is most common), relational (SQL) and non-relational (NoSQL) sources, too.</p>

<hr />
<div style="background-color:#AADBFF;padding:10px;border-radius:3px;font-size:24px;font-family:Avenir,Arial;">
    <b>Panda's Components</b>
</div>
<p>
Pandas components are <code>Series</code> (a column) and <code>DataFrame</code>, a multi-dimensional table made up of Series.
<img style="width:400px;" src="images/series-and-dataframe.width-1200.png" />
Use pandas to create a frame (that looks like a combo of json and a dictionary):
<br />
<pre>
data = { 
   'cats': [1, 4, 2, 8],
   'dogs': [1, 3, 3, 2]
}
pets = pd.DataFrame(data)
</pre>
</p>

<hr />
<div style="background-color:#AADBFF;padding:10px;border-radius:3px;font-size:24px;font-family:Avenir,Arial;">
    <b>Recap: Exploration &amp; Analysis: Transformations for analysis</b>
</div>

<table style="font-size:20px;font-family:Avenir,Arial; line-height:24px;">
<tr>
    <td style="text-align:left" width="12%">Type</td>
    <td style="text-align:left" width="30%">Description</td>
    <td style="text-align:left">Example</td>
</tr>
<tr>
<td style="text-align:left">Series</td>
<td style="text-align:left">1D labeled homogeneous array, size is immutable.</td>
<td style="font-family:Courier; text-align:left">
import pandas as pd<br />
import numpy as np<br />
data = np.array(['a','b','c'])<br />
s = pd.Series(data)<br />
print(s)
</td>
</tr>
<tr><td style="text-align:left">Data Frames</td>
<td style="text-align:left">General 2d labeled, size-mutable tabular structure with potentially heterogeneous-typed columns</td>
<td style="font-family:Courier; text-align:left">
import pandas as pd<br />
data = [1, 2, 3, 4, 5]<br />
df = pd.DataFrame(data)<br />
    </td>
    </tr>
    <tr><td style="text-align:left">Panel</td>
        <td style="text-align:left">General 3d labeled, size-mutable array</td>
<td style="font-family:Courier; text-align:left">
    import pandas as pd<br />
    import numpy as np<br />
    data = np.random.rand(2, 4, 5)<br />
    p = pd.Panel(data)<br />
        </td>
    </tr>
    </table>
<div style="height:100px;"><hr /></div>

<hr />
<h3>1. Working thru a small example.</h3>
<p>Let's start off with some data from .json just to remember some basic Pandas commands.</p>

In [12]:
# By Hand
# as we know from this week's asynch 
import pandas as pd
import numpy as np

pay = {
  "Fifi":{
    "title":"manager",
    "salary":60000,
    "cost_of_living":1.0,
    "new_salary":60000,
  },
  "Juan":{
    "title":"manager",
    "salary":60000,
    "cost_of_living":1.0,
    "new_salary":60000,
  },
  "Toby":{
    "title":"programmer",
    "salary":60000,
    "cost_of_living":1.0,
    "new_salary":60000,
  },
  "Gerry":{
    "title":"analyst",
    "salary":22200,
    "cost_of_living":1.0,
    "new_salary":22222,
  },
  "Padma":{
    "title":"analyst",
    "salary":25200,
    "cost_of_living":1.0,
    "new_salary":22222,
  }
}

print("\n")

df = pd.DataFrame(pay)
print(df)
print("\n")
# accessing data as an attribute
print(df.index)

# let's check THE COLUMN HEADINGS before continuing.
for col in df.columns:
    print(col)



                   Fifi     Juan        Toby    Gerry    Padma
title           manager  manager  programmer  analyst  analyst
salary            60000    60000       60000    22200    25200
cost_of_living        1        1           1        1        1
new_salary        60000    60000       60000    22222    22222


Index(['title', 'salary', 'cost_of_living', 'new_salary'], dtype='object')
Fifi
Juan
Toby
Gerry
Padma


<hr />
<h3>But the data could be on the wrong axis for our needs ... </h3>

In [13]:
# check out the file ... is this right?
print(df.head())

print("\n")
# no, so let's transpose it.
pay = pd.DataFrame(pay).T
print(pay.head())  # override the default of 5

                   Fifi     Juan        Toby    Gerry    Padma
title           manager  manager  programmer  analyst  analyst
salary            60000    60000       60000    22200    25200
cost_of_living        1        1           1        1        1
new_salary        60000    60000       60000    22222    22222


            title salary cost_of_living new_salary
Fifi      manager  60000              1      60000
Juan      manager  60000              1      60000
Toby   programmer  60000              1      60000
Gerry     analyst  22200              1      22222
Padma     analyst  25200              1      22222


<hr />
<h3 style="color:cornflowerblue;">Locating data in a DF using <code>loc</code> and <code>iloc</code></h3>

In [14]:
# The same output - from different work pov - By name or by location

print("Fifi's Data: ", pay.loc['Fifi'])

print("\n","-"*10,"\n")

print("iloc 1 = ", pay.iloc[1])

Fifi's Data:  title             manager
salary              60000
cost_of_living          1
new_salary          60000
Name: Fifi, dtype: object

 ---------- 

iloc 1 =  title             manager
salary              60000
cost_of_living          1
new_salary          60000
Name: Juan, dtype: object


In [15]:
# here we work from row/col idea for loc to method (.mean() to a folder method)

row = 'Juan'
col = 'salary'
print("Let's check out some salaries:  Juan's salary is ", pay.loc[row, col], "\n")

print("And the company mean is: ", pay['salary'].mean(), "\n")

# let's extract all the people below a salary point
print(pay.salary < 60000, "\n")

print("\n","-"*40,"\nWho's underpaid? \n\t")
print(pay.salary < pay['salary'].mean())

Let's check out some salaries:  Juan's salary is  60000 

And the company mean is:  45480.0 

Fifi     False
Juan     False
Toby     False
Gerry     True
Padma     True
Name: salary, dtype: bool 


 ---------------------------------------- 
Who's underpaid? 
	
Fifi     False
Juan     False
Toby     False
Gerry     True
Padma     True
Name: salary, dtype: bool


<blockquote>
    <p>Reminder that there are a lot of stat-oriented methods built-in: <code>sum()</code>, <code>min()</code>,
        <code>max()</code>,<code>std()</code>,
        <code>corr()</code>,<code>cov()</code>, and the all-in-one <code>describe()</code> and 
        <code>value_counts()</code>
</blockquote>

In [16]:
# RECAP: finding by location
pay.loc[pay['salary'] == 60000]

Unnamed: 0,title,salary,cost_of_living,new_salary
Fifi,manager,60000,1,60000
Juan,manager,60000,1,60000
Toby,programmer,60000,1,60000


In [17]:
pay.loc[(pay['salary'] >= 40000) & (pay['salary'] <= 75000)]

Unnamed: 0,title,salary,cost_of_living,new_salary
Fifi,manager,60000,1,60000
Juan,manager,60000,1,60000
Toby,programmer,60000,1,60000


In [18]:
# Empty Cells: 2 ways:
# fillna or masking data - replace all values which meet some condition
# say some test requires values > something or you're checking data types.

df = pd.DataFrame({"A":[.2, .4, .1, None, .2],
                   "B":[.7, .2, .4, .3, None],
                   "C":[.14, .3, None, .2, .7],
                   "D":[.2, .3, .4, .2, .6],
                   "E":[.3, .3, .3, .2, .5],
                   "F":[.25, .3, .02, .2, .45],
                   "G":[.25, .3, .01, .2, .5]
                  })
  
# replace the None values with .1
df.mask(df.isna(), .1)

Unnamed: 0,A,B,C,D,E,F,G
0,0.2,0.7,0.14,0.2,0.3,0.25,0.25
1,0.4,0.2,0.3,0.3,0.3,0.3,0.3
2,0.1,0.4,0.1,0.4,0.3,0.02,0.01
3,0.1,0.3,0.2,0.2,0.2,0.2,0.2
4,0.2,0.1,0.7,0.6,0.5,0.45,0.5


In [19]:
df = pd.DataFrame({"A":[.2, .4, .1, None, .2],
                   "B":[.7, .2, .4, .3, None],
                   "C":[.14, .3, None, .2, .7],
                   "D":[.2, .3, .4, .2, .6],
                   "E":[.3, .3, .3, .2, .5],
                   "F":[.25, .3, .02, .2, .45],
                   "G":[.25, .3, .01, .2, .5]
                  })
df.fillna(0.1)

Unnamed: 0,A,B,C,D,E,F,G
0,0.2,0.7,0.14,0.2,0.3,0.25,0.25
1,0.4,0.2,0.3,0.3,0.3,0.3,0.3
2,0.1,0.4,0.1,0.4,0.3,0.02,0.01
3,0.1,0.3,0.2,0.2,0.2,0.2,0.2
4,0.2,0.1,0.7,0.6,0.5,0.45,0.5


<hr />
<h2 style="color: cornflowerblue;">Part 2: Tranisitioning to Use-case scenarios: working coding into our work practices.</h2>
<p>We can never be certain about the source and qualify of our data, nor how we'll perform our tasks until we've get the files and prepare the data.  Say you work for a company and periodically you&rsquo;re asked to provide reports on salaries, calculate cost_of_living and salary bonuses.  For fun, the data come to you from the payroll manager who is on her way out the door for vacation. Consequently you have to check the data a lot before creating your reports, etc.  BTW, usually she sends data in a tab-delimited file (.tsv) but today she sent a .csv.  Let&rsquo;s see what we can do ... </p>
<p>The async and many websites show how easy it can be to read data in from a .csv file.  But on the job things mayn't go that easily.  Here we have a comma-delimited file exported from Apple Numbers.  Commands can act very differently based on our computing environment.  Here the .csv file needs some cleanup.
</p>

In [20]:
# IMHO - make sure people DON'T include the title of the exported data
# otherwise you have to do a lot of fiddling around w/ pandas to reset the index 
import pandas as pd

# in-class - change the .csv to .tsv and reload. Use .csv for rest of notebook.
# notice since the record numbers (0, 1, 2, ... ) in column 0 are exported, too, they'll
# show up in the import.  The next cell shows the pretty version of the "pay" data

# NOTE: hard-coding the address to a file isn't a great idea
# because you can't port your code.
pay_test = pd.read_csv('BearCo.csv')
print(type(pay))

print(pay_test)


print("\n","-"*50,"\nSuppressing the row numbers during import using index_col=0:\n")


pay = pd.read_csv('BearCo.csv', index_col = 0)
print(pay)

<class 'pandas.core.frame.DataFrame'>
  Unnamed: 0       title  salary_base  cost_of_living  review_percent  \
0       Fifi     manager        50000            0.01            0.01   
1        Rex     manager        48000            0.01            0.01   
2       Toby  programmer        40000            0.01            0.01   
3       Lars     analyst        42500            0.01            0.01   
4      Gerry    designer        25000            0.01            0.01   

   revised_salary  Unnamed: 6  
0             NaN         NaN  
1             NaN         NaN  
2             NaN         NaN  
3             NaN         NaN  
4             NaN         NaN  

 -------------------------------------------------- 
Suppressing the row numbers during import using index_col=0:

            title  salary_base  cost_of_living  review_percent  \
Fifi      manager        50000            0.01            0.01   
Rex       manager        48000            0.01            0.01   
Toby   programmer

In [11]:
pay

Unnamed: 0,title,salary_base,cost_of_living,review_percent,revised_salary,Unnamed: 6
Fifi,manager,50000,0.01,0.01,,
Rex,manager,48000,0.01,0.01,,
Toby,programmer,40000,0.01,0.01,,
Lars,analyst,42500,0.01,0.01,,
Gerry,designer,25000,0.01,0.01,,


<hr />
<h3>Addressing issues when reading-in data ... </h3>
<p>The column on the left is fine; but the column labels and the extra empty column (Unnamed: 6) need to be addressed.</p>
<p>First let's remove that empty column; then try to use the data from the first row as our new column labels.</p>

In [21]:
# check out column names
print(pay.columns)

print(pay.columns[5])

#pay.drop(pay.columns[5], inplace = True)
pay.drop(pay.columns[5], axis=1, inplace=True)
pay

Index(['title', 'salary_base', 'cost_of_living', 'review_percent',
       'revised_salary', 'Unnamed: 6'],
      dtype='object')
Unnamed: 6


Unnamed: 0,title,salary_base,cost_of_living,review_percent,revised_salary
Fifi,manager,50000,0.01,0.01,
Rex,manager,48000,0.01,0.01,
Toby,programmer,40000,0.01,0.01,
Lars,analyst,42500,0.01,0.01,
Gerry,designer,25000,0.01,0.01,


In [22]:
# iloc test
print(pay.iloc[1,1])  # returns 50000

# loc test
print("\nRex's salary is ", pay.loc['Rex', 'salary_base'])

48000

Rex's salary is  48000


<hr />
<h3 style="color:cornflowerblue;">Creating our own methods and using <code>assign()</code></h3>
<p>method for changing our data in an entire column</p>

In [24]:
# change cost_of_living from 0.01 to a new value
# let's try the assign 
pay = pay.assign(cost_of_living = 0.52)
pay

Unnamed: 0,title,salary_base,cost_of_living,review_percent,revised_salary
Fifi,manager,50000,0.52,0.01,
Rex,manager,48000,0.52,0.01,
Toby,programmer,40000,0.52,0.01,
Lars,analyst,42500,0.52,0.01,
Gerry,designer,25000,0.52,0.01,


<hr /><p>After the reviews, the Boss allows an increase in salary. Lars is getting a boost.</p>

In [25]:
pay.loc["Lars","review_percent"] = 3
pay.loc["Gerry","review_percent"] = 3
pay

Unnamed: 0,title,salary_base,cost_of_living,review_percent,revised_salary
Fifi,manager,50000,0.52,0.01,
Rex,manager,48000,0.52,0.01,
Toby,programmer,40000,0.52,0.01,
Lars,analyst,42500,0.52,3.0,
Gerry,designer,25000,0.52,3.0,


<hr /><p>Update the revised_salary column with new values... </p>

In [26]:
# overcoded but to show the workings of our using our own functions
def update_pay(row):
    base = row['salary_base']
    return base + base * (row['cost_of_living']*.10) + base * (row['review_percent']*.10)

pay['revised_salary'] = pay.apply(lambda row: update_pay(row), axis = 1)
pay

Unnamed: 0,title,salary_base,cost_of_living,review_percent,revised_salary
Fifi,manager,50000,0.52,0.01,52650.0
Rex,manager,48000,0.52,0.01,50544.0
Toby,programmer,40000,0.52,0.01,42120.0
Lars,analyst,42500,0.52,3.0,57460.0
Gerry,designer,25000,0.52,3.0,33800.0


In [27]:
# let's clean up that NaN for now ... replacing it with the salary_base

pay.salary_base = pay.revised_salary
pay

Unnamed: 0,title,salary_base,cost_of_living,review_percent,revised_salary
Fifi,manager,52650.0,0.52,0.01,52650.0
Rex,manager,50544.0,0.52,0.01,50544.0
Toby,programmer,42120.0,0.52,0.01,42120.0
Lars,analyst,57460.0,0.52,3.0,57460.0
Gerry,designer,33800.0,0.52,3.0,33800.0


<hr />
<p>This example should demonstrate some outline the basic features.</p>

<p>We'll apply these and a lot more commands about grouping and aggregating data from a much larger dataset in the next few lessons.</p>