
 #  Control Structures, DataFrames and Descriptive Statistics

## PETE 2061 Lab 4

<a id='top'></a>

<a id='overview'></a>
# Topics Covered
<font color=blue>
 * [Control structures and functions](#control)
 * [Exiting loops](#exit_loops)
 * [Nested loops](#nested_loops)
 * [List Comprehension](#list_comprehension)
 * [Pandas Series](#series)
 * [Pandas DataFrames](#dataframes)
 * [Grouping](#grouping)    
 * [Interactive input](#input)</font>
<br>

<a id='control'></a>
## Control Structures 
<font color=blue>
* if-else
* for loops, while loops
* break: jump out of the current loop
* continue: jump to the top of next cycle within the loop
* pass: do nothing 
</font> 
[top](#top)

An "if statement" is written using the "if" keyword

In [1]:
a = 10
b = 12

if (b > a):
    print("b is greater than a")
    x = 8
    y = 3
else:
    print("b is not greater than a")
    a = 2
    a *= 2

print("I'm here")

b is greater than a
I'm here


* As in the "for" and "while" statements, indentation is important.
* Use the "tab" key for indentation
* As in the "for" and "while" statements, do not forget the colon (:) after the if statement!

In [5]:
if (b < a):
    print("b is greater than a")   # This will give an error because there is no indentation after the if statement
    print("1")
    x = 2
    y = 3
print("2")

b is greater than a


* The elif keyword is pythons way of saying "if the previous conditions were not true, then try this condition".
* It basically means "else if"

In [None]:
a = 10
b = 9

if (b > a):
    print("b is greater than a")
elif (a == b):
    print("a and b are equal")

The "else" keyword catches anything which isn't caught by the preceding conditions.

In [7]:
a = 20
b = 10
if b > a:
    print("b is greater than a")
elif a == b:
    print("a and b are equal")
else:
    print("a is greater than b")

a is greater than b


You can also have an else without the elif:

In [None]:
a = 20
b = 10
if b > a:
    print("b is greater than a")
else:
    print("b is not greater than a")

If you have only one statement to execute, you can put it on the same line as the if statement.

In [None]:
if a > b: print("a is greater than b")

If you have only one statement to execute, one for if, and one for else, you can put it all on the same line:

In [None]:
a = 10
b = 20
print("A") if (a > b) else print("B")

You can also have multiple else statements on the same line:

In [None]:
a = 10
b = 10
print("A") if a > b else print("=") if a == b else print("B")

* The "and" keyword is a logical operator, and can be used to combine conditional statements
* The "or" keyword is a logical operator, and can be used to combine conditional statements

In [None]:
a = 20
b = 30
c = 50
if (a > b) and (c > a):
    print("Both conditions are True")

if (a > b) or (a > c):   #note that I do not indent this if statement, so that it is not nested under the previous if statement
    print("At least one of the conditions is True")

You can have if statements inside if statements, this is called nested if statements.

In [None]:
x = 10
if x > 10:
    print("Above ten,")
    if x > 20:
        print("and also above 20!")
    else:
        print("but not above 20.")


<a id='exit_loops'></a>
## Using break and continue statements in loops
* With the break statement we can stop the loop before it has looped through all the items
* With the continue statement we can stop the current iteration of the loop, and continue with the next

In [None]:
fruits = ["apple","kiwi", "banana", "cherry", "orange"]
for x in fruits:
    print(x) 
    if x == "banana":
        break

#Can you modify this code snippet so that it breaks out of the loop before printing banana?    

In [None]:
fruits = ["apple","kiwi", "banana", "cherry", "orange"]
for x in fruits:
    if x == "banana":
        continue     
    print(x)

<a id='nested_loops'></a>
## Nested Loops
* A nested loop is a loop inside a loop.
* The "inner loop" will be executed one time for each iteration of the "outer loop":

In [None]:
adj = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]

for x in adj:
    for y in fruits:
        print(x, y)

<a id='dataframes'></a>
## Pandas DataFrames 
* Works like a Spreadsheet (eg. Excel)
* It has several in-built functions for descriptive statistics 
* It can be used for data input and output (io)
* It can be created from a dictionary
* We can think of a DataFrame as a bunch of Series objects put together to share the same index. 

In [57]:
import pandas as pd
import numpy as np

In [None]:
from numpy.random import randn   # this is how to import a specific function from a package
np.random.seed(101)              # Using a seed ensures the same random number is obtained everytime the code is run

In [58]:
myIndex = 'A B C D E'.split()  #This creates a list ['A','B','C','D','E']
print(myIndex)
print(type(myIndex))

['A', 'B', 'C', 'D', 'E']
<class 'list'>


In [62]:
myRand = randn(5,4)                # creates a 5x4 matrix of random numbers from the standard normal distribution
myRand2 = np.random.random((5,4))  # creates a 5x4 matrix of random numbers from the continuous uniform distribution 
print(myRand)
print(myRand2)
df = pd.DataFrame(myRand, index=myIndex, columns='W X Y Z'.split())
df
#1st argument is data, 2nd argument is index for the rows, third is column headings.

[[-0.04567648  0.01242099  0.09362798  1.24081261]
 [-1.09769302 -1.90800882 -0.3801035  -1.66605918]
 [-2.7369946   1.52256211  0.17800909 -0.62680541]
 [-0.39108897  1.74347695  1.13001805  0.89779631]
 [ 0.33086562 -1.06304889 -0.1253808  -0.94558812]]
[[0.43777358 0.62069599 0.36917206 0.39532618]
 [0.30857136 0.59208225 0.18479466 0.96744482]
 [0.87950369 0.11951996 0.69008754 0.90400383]
 [0.60412904 0.49140776 0.79315227 0.17702482]
 [0.34202957 0.25553147 0.2326785  0.76111692]]


Unnamed: 0,W,X,Y,Z
A,-0.045676,0.012421,0.093628,1.240813
B,-1.097693,-1.908009,-0.380104,-1.666059
C,-2.736995,1.522562,0.178009,-0.626805
D,-0.391089,1.743477,1.130018,0.897796
E,0.330866,-1.063049,-0.125381,-0.945588


In [63]:
help(randn)   # the help function gives a description of the function specified as an argument

Help on built-in function randn:

randn(...) method of mtrand.RandomState instance
    randn(d0, d1, ..., dn)
    
    Return a sample (or samples) from the "standard normal" distribution.
    
    If positive, int_like or int-convertible arguments are provided,
    `randn` generates an array of shape ``(d0, d1, ..., dn)``, filled
    with random floats sampled from a univariate "normal" (Gaussian)
    distribution of mean 0 and variance 1 (if any of the :math:`d_i` are
    floats, they are first converted to integers by truncation). A single
    float randomly sampled from the distribution is returned if no
    argument is provided.
    
    This is a convenience function.  If you want an interface that takes a
    tuple as the first argument, use `numpy.random.standard_normal` instead.
    
    Parameters
    ----------
    d0, d1, ..., dn : int, optional
        The dimensions of the returned array, should be all positive.
        If no argument is given a single Python float is ret

In [64]:
help(np.random.random)

Help on built-in function random_sample:

random_sample(...) method of mtrand.RandomState instance
    random_sample(size=None)
    
    Return random floats in the half-open interval [0.0, 1.0).
    
    Results are from the "continuous uniform" distribution over the
    stated interval.  To sample :math:`Unif[a, b), b > a` multiply
    the output of `random_sample` by `(b-a)` and add `a`::
    
      (b - a) * random_sample() + a
    
    Parameters
    ----------
    size : int or tuple of ints, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn.  Default is None, in which case a
        single value is returned.
    
    Returns
    -------
    out : float or ndarray of floats
        Array of random floats of shape `size` (unless ``size=None``, in which
        case a single float is returned).
    
    Examples
    --------
    >>> np.random.random_sample()
    0.47108547995356098
    >>> type(np.random.random_

## Selection and Indexing
This refers to the various methods to grab data from a DataFrame

In [65]:
df['W']

A   -0.045676
B   -1.097693
C   -2.736995
D   -0.391089
E    0.330866
Name: W, dtype: float64

In [66]:
# Pass a list of column names
df[['W','Z']]   # Note the square brackets defining the list ['W','Z']

Unnamed: 0,W,Z
A,-0.045676,1.240813
B,-1.097693,-1.666059
C,-2.736995,-0.626805
D,-0.391089,0.897796
E,0.330866,-0.945588


In [68]:
# DataFrame Columns are just Series
type(df['W'])

pandas.core.series.Series

**Creating a new column:**

In [69]:
df['new'] = df['W'] + df['Y']
df

Unnamed: 0,W,X,Y,Z,new
A,-0.045676,0.012421,0.093628,1.240813,0.047952
B,-1.097693,-1.908009,-0.380104,-1.666059,-1.477797
C,-2.736995,1.522562,0.178009,-0.626805,-2.558986
D,-0.391089,1.743477,1.130018,0.897796,0.738929
E,0.330866,-1.063049,-0.125381,-0.945588,0.205485


Removing Columns

In [70]:
df.drop('new',axis=1)   # axis = 1 means it is dropping a column

Unnamed: 0,W,X,Y,Z
A,-0.045676,0.012421,0.093628,1.240813
B,-1.097693,-1.908009,-0.380104,-1.666059
C,-2.736995,1.522562,0.178009,-0.626805
D,-0.391089,1.743477,1.130018,0.897796
E,0.330866,-1.063049,-0.125381,-0.945588


In [71]:
# Not in-place unless specified!  (this means that the "new" column has not been removed in memory). So df is unchanged
df

Unnamed: 0,W,X,Y,Z,new
A,-0.045676,0.012421,0.093628,1.240813,0.047952
B,-1.097693,-1.908009,-0.380104,-1.666059,-1.477797
C,-2.736995,1.522562,0.178009,-0.626805,-2.558986
D,-0.391089,1.743477,1.130018,0.897796,0.738929
E,0.330866,-1.063049,-0.125381,-0.945588,0.205485


In [72]:
df.drop('new',axis=1,inplace=True)    #this specification of inplace=True now drops the "new" column from df in memory
df

Unnamed: 0,W,X,Y,Z
A,-0.045676,0.012421,0.093628,1.240813
B,-1.097693,-1.908009,-0.380104,-1.666059
C,-2.736995,1.522562,0.178009,-0.626805
D,-0.391089,1.743477,1.130018,0.897796
E,0.330866,-1.063049,-0.125381,-0.945588


In [73]:
# You can also drop rows this way
df.drop('E',axis=0)

Unnamed: 0,W,X,Y,Z
A,-0.045676,0.012421,0.093628,1.240813
B,-1.097693,-1.908009,-0.380104,-1.666059
C,-2.736995,1.522562,0.178009,-0.626805
D,-0.391089,1.743477,1.130018,0.897796


In [74]:
df

Unnamed: 0,W,X,Y,Z
A,-0.045676,0.012421,0.093628,1.240813
B,-1.097693,-1.908009,-0.380104,-1.666059
C,-2.736995,1.522562,0.178009,-0.626805
D,-0.391089,1.743477,1.130018,0.897796
E,0.330866,-1.063049,-0.125381,-0.945588


In [75]:
df.drop('E',axis=0,inplace=True)
df

Unnamed: 0,W,X,Y,Z
A,-0.045676,0.012421,0.093628,1.240813
B,-1.097693,-1.908009,-0.380104,-1.666059
C,-2.736995,1.522562,0.178009,-0.626805
D,-0.391089,1.743477,1.130018,0.897796


** Selecting Rows**

In [76]:
df.loc['A']  #this will extract the first row, which corresponds to index A

W   -0.045676
X    0.012421
Y    0.093628
Z    1.240813
Name: A, dtype: float64

** Or select based off of position instead of label **

In [77]:
df.iloc[2]   #this will extract the 3rd row, which corresponds to index C

W   -2.736995
X    1.522562
Y    0.178009
Z   -0.626805
Name: C, dtype: float64

** Selecting subset of rows and columns **

In [78]:
df.loc['B','Y']  #will return data at row B, column Y

-0.3801035020513599

In [80]:
df.loc[['A','B'],['W','Y']]  #will return data in rows A and B, and columns W and Y
#df.iloc[[0,1],[0,2]]

Unnamed: 0,W,Y
A,-0.045676,0.093628
B,-1.097693,-0.380104


In [87]:
t = np.arange(0,10,0.1) #this creates a range of numbers from 0 to 10, evenly spaced in 0.1 increments
x = np.sin(t)
y = np.cos(t)
df = pd.DataFrame({'Time':t, 'x':x, 'y':y}) #This creates a dataframe with three columnns, time, x, and y, with the Column header names
df.head(5) #This shows the first (n) rows of the dataframe, if left blank it automatically gives 5 rows

Unnamed: 0,Time,x,y
0,0.0,0.0,1.0
1,0.1,0.099833,0.995004
2,0.2,0.198669,0.980067
3,0.3,0.29552,0.955336
4,0.4,0.389418,0.921061


In [88]:
df.tail() #This will display the last (n) rows of the dataframe

Unnamed: 0,Time,x,y
95,9.5,-0.075151,-0.997172
96,9.6,-0.174327,-0.984688
97,9.7,-0.271761,-0.962365
98,9.8,-0.366479,-0.930426
99,9.9,-0.457536,-0.889191


In [89]:
df['Time'][1:3] #It works from left to right. After selecting the Time column, select only the 2nd and 3rd rows.
#df.Time[1:3]  #This old approach works, but it is advisable to avoid this.

1    0.1
2    0.2
Name: Time, dtype: float64

In [90]:
df['Time'][-5:] #indexing works this way too. All you learned about index works

95    9.5
96    9.6
97    9.7
98    9.8
99    9.9
Name: Time, dtype: float64

In [91]:
data_sub = df[['Time', 'y']] #create a subset of only two columns of data out of the dataframe using title as index
data_sub.head()

Unnamed: 0,Time,y
0,0.0,1.0
1,0.1,0.995004
2,0.2,0.980067
3,0.3,0.955336
4,0.4,0.921061


In [92]:
df[['Time', 'y']][4:10] #Indexing both columnns and rows at the same time. This is BACKWARDS compared to numpy arrays!!!

Unnamed: 0,Time,y
4,0.4,0.921061
5,0.5,0.877583
6,0.6,0.825336
7,0.7,0.764842
8,0.8,0.696707
9,0.9,0.62161


In [None]:
df.values[1:6] #use values if you want to see the raw numpy array data and not in the datafram format.
#This is typically used to convert Pandas DataFrames into Numpy Arrays.
#Here it gives the data in rows 1 to 5 (remember row 1 is 2nd row of data)

In [93]:
df.describe()  #This function gives you useful descriptive statistics on the dataframe

Unnamed: 0,Time,x,y
count,100.0,100.0,100.0
mean,4.95,0.186474,-0.045161
std,2.901149,0.667424,0.726266
min,0.0,-0.999923,-0.999693
25%,2.475,-0.368329,-0.793512
50%,4.95,0.31532,-0.079077
75%,7.425,0.800989,0.687587
max,9.9,0.999574,1.0


<a id='input'></a>
## interactive input
You can use the input() function to read in input interactively. The input will be a string which can be converted into other data types as needed

In [None]:
numStr = input("Enter an integer: ")
num = int(numStr)
num2 = int(input("Enter an integer: "))

print(num,num2)

<a id='grouping'></a>
## Group by
The groupby method allows you to group rows of data together and call aggregate functions

In [None]:
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df

* Now you can use the .groupby() method to group rows together based off of a column name. 
* As an example, we can group by Company. This will create a DataFrameGroupBy object

In [None]:
df.groupby('Company')

You can save this object as a new variable:

In [None]:
by_comp = df.groupby("Company")

And then call aggregate methods off the object:

In [None]:
by_comp.mean()

In [None]:
#This can all be done in a one-line Pythonic code:
df.groupby('Company').mean()

More examples of aggregate methods:

In [None]:
by_comp.std()

In [None]:
by_comp.min()

In [None]:
by_comp.describe()

In [None]:
by_comp.describe().transpose()

In [None]:
by_comp.describe().transpose()['GOOG']

In [None]:
data = pd.DataFrame({'Well Type':['o','o','g','o','g','g','g','o'],
                    'Well Depth':[3500,2800,3000,3233,3010,5500,3600,4840]})
data.head()

In [None]:
grouped = data.groupby('Well Type')
print(grouped.describe())

In [None]:
df_oil = grouped.get_group('o')
values_oil = df_oil.values
print(values_oil)

In [None]:
type(grouped)