# Welcome to Python : An Introduction to Data Science

Please sign in if you have not already. We use this data to improve our workshops!

### About UF DSI

We are an multi- and inter- disciplinary student organization that is dediated to promoting Data Science here at the Univeristy of Florida. We are partnered with the UF Informatics Institute who's aim is to foster informatics research and education.

### What is Python?

Python is an easy-to-use and robust **Object-Oriented** programming language. A lot of new software application are built with Python for this reason. It is used in other areas of computer science such and software engineering, digital arts, cybersecurity, and of course Data Science! 

This is a workshop that will introduce you to the basics of python and introduce you to Data Science and Visualization in Python. Due to the breadth of the language there are still many topics left for you to explore! Here we teach you the necessary skills. 


## Variables and Types

#### Calculator

Python can be used as a calculator. <code>Shift+Enter</code> runs the code block so you don't have to click run every time

In [1]:
# Addition and Subtraction


In [2]:
# Multiplication and Division


In [3]:
# Exponentation


Variables can be given alphanumeric names beginning with an underscore or letter.  Variable types do not have to be declared and are inferred at run time.

In [4]:
# Built in function

Strings can be declared with either single or double quotes.

## Modules and Import
Files with a .py extension are known as Modules in Python.  Modules are used to store functions, variables, and class definitions.  

Modules that are not part of the standard Python library are included in your program using the <code>import</code> statement.

In [5]:
# To use Math, we must import it

Whoops.  Importing the <code>math</code> module allows us access to all of its functions, but we must call them in this way

Alternatively, you can use the <code>from</code> keyword

In [6]:
# we only imported cos, not the pi constant

Using the <code>from</code> statement we can import everything from the math module.  

Disclaimer: many Pythonistas discourage doing this for performance reasons.  Just import what you need

In [7]:
 # now we don't have to make a call to math

## Strings
As you may expect, Python has a powerful, full featured string module.  

### Substrings
Python strings can be substringed using bracket syntax

Python is a 0-index based language.  Generally whenever forming a range of values in Python, the first argument is inclusive whereas the second is not, i.e. <code>mystring[11:25]</code> returns characters 11 through 24.

You can omit the first or second argument

In [8]:
# all characters before the 9th index

In [9]:
# all characters at or after the 27th

In [10]:
 # you can even omit both arguments

Using negative values, you can count positions backwards

### String Functions
Here are some more useful string functions
#### find

In [11]:
# returns the index of the first occurence of Gators

Looks like nothing was found.  -1 is returned by default.

In [12]:
# no Seminoles here

#### lower and upper

#### split

In [13]:
 # returns a list of strings broken by a space by default

In [14]:
 # you can also define the separator

#### join

The <code>join</code> is useful for building strings from lists or other iterables.  Call <code>join</code> on the desired separator

For more information on string functions:

https://docs.python.org/2/library/stdtypes.html#string-methods

## Lists
The Python standard library does not have traditional C-style fixed-memory fixed-type arrays.  Instead, lists are used and can contain a mix of any type.

Lists are created with square brackets []

In [None]:
mylist = [1, 2, 3, 4, 'five']
print(mylist)

In [None]:
# add an item to the end of the list
print(mylist)

In [None]:
 # insert the number 7 at index 6
print(mylist)

In [None]:
 # removes the first matching occurence 
print(mylist)

In [None]:
# by default, the last item in the list is removed and returned
print(popped)
print(mylist)

In [None]:
# returns the length of any iterable such as lists and strings

In [None]:
# default list sorting. When more complex objects are in the list, arguments can be used to customize how to sort

print(mylist)

In [None]:
# reverse the list
print(mylist)

For more information on Lists:

https://docs.python.org/2/tutorial/datastructures.html#more-on-lists

## Conditionals
Python supports the standard if-else-if conditional expression. REMEMBER TO INDENT

In [None]:
a = 1; b = 2

if a > b:
    print ("a is greater than b")
elif a < b:
    print ("a is less than b")
else:
    print("a is equal to b")

## Loops
Python supports for, foreach, and while loops
### For (counting)
Traditional counting loops are accomplished in Python with a combination of the <code>for</code> key word and the <code>range</code> function

In [None]:
 # with one argument, range produces integers from 0 to 9


In [None]:
# with three arguments, range starts at 1 and goes in steps of 3 until greater than 12


### Foreach
As it turns out, counting loops are just foreach loops in Python.  The <code>range</code> function returns a list of integers over which <code>for in</code> iterates.  This can be extended to any other iterable type

In [None]:
# iterate over a list of strings


## Functions
Functions in Python do not have a distinction between those that do and do not return a value.  If a value is returned, the type is not declared.

Functions can be declared in any module without any distinction between static and non-static.  Functions can even be declared within other functions

The syntax is the following

In [None]:
 # use some arguments 
    # cast number to a string when concatenating
    


Functions can have optional arguments if a default value is provided in the function signature

In [None]:
# optional team argument

    
# no team argument supplied

In [None]:
# supplying all three arguments

Python functions can be called using named arguments, instead of positional

### return
In Python functions, an arbitrary number of values can be returned

In [None]:
# return a single value



# Data Science Tutorial
Now that we've covered some Python basics, we will begin a tutorial going through many tasks a data scientist may perform.  We will obtain real world data and go through the process of auditing, analyzing, visualing, and building classifiers from the data.

We will use a database of selected professor salaries which can be found using this link: 
https://data.austintexas.gov/Health-and-Community-Services/Client-Demographics-for-the-Medical-Transportation/6jna-snvk

This data covers the demographics of Medical Transport Program by the Ryan White Grants
## Obtaining the Data
Using the pandas library we can easily import data from a given link or from a file on our computer (must know syntax for filepath). In this case we will give it a link.

In [28]:
import pandas as pd # import the module and alias it as pd

data = pd.read_csv('data.csv')
data.head() # show the first few rows of the data

Unnamed: 0,Client ID,Age Range,Gender,Education Level,Insurance,Race,Hispanic Ethnicity,Language Primary,Living Situation,31 - Day Metro Pass,...,Disability Fare Card,Gas Voucher,Metro ACCESS book,STS ticket,Taxi Voucher,Van/Car Ride,Agency,Grant Year,Grant Dates,Grant Name
0,10015,61 - 70,Female,Some college education,Medi-Cal/Medicaid,White,False,English,Rental housing,,...,,,,,,43.0,Community Action,2013,1/1/2013-12/31/2013,Ryan White Part C
1,10015,61 - 70,Female,Some college education,Medi-Cal/Medicaid,White,False,English,Rental housing,,...,,,,,,52.0,Community Action,2014,1/1/2014-12/31/2014,Ryan White Part C
2,10015,61 - 70,Female,Some college education,Medi-Cal/Medicaid,White,False,English,Rental housing,,...,,6.0,,,,36.0,Community Action,2015,1/1/2015-12/31/2015,Ryan White Part C
3,10015,61 - 70,Female,Some college education,Medi-Cal/Medicaid,White,False,English,Rental housing,,...,,9.0,,,,21.0,Community Action,2015,1/1/2015-12/31/2015,Ryan White Part C
4,10015,61 - 70,Female,Some college education,Medi-Cal/Medicaid,White,False,English,Rental housing,,...,,9.0,,,,21.0,Community Action,2016,1/1/2016-12/31/2016,Ryan White Part C


## Accessing your data

In [15]:
# Better way using loc


#### Looking at all the columns

### Check Unique Values and Types

array([ nan,   6.,   9.,   5.,  12.,   3.,  21.,  14.,   7.,  10.,   2.,
        22.,  27.,   4.,  11.,  50.,  51.,  53.,  15.,   1.,   8.,  13.,
        45.,  24.,  38.,  16.,  25.,  17.,  34.,  48.,  58.,  39.,  60.,
        28.,  40.,  19.,  29.,  36.,  18.,  20.,  26.,  35.,  31.,  46.,
        54.,  65.,  71.,  66.,  72., 101.,  41.,  78.,  32.,  44.])

In [16]:
# quick way to look at all the types
for x in data.columns:
    print(x + "\t\t\t" + str(type(data.loc[:,x].unique()[0])))

#### Accessing Data using Index

Structurally, Pandas dataframes are a collection of Series objects sharing a common index.  In general, the Series object and Dataframe object share a large number of functions with some behavioral differences.  In other words, whatever computation you can do on a single column can generally be applied to the entire dataframe.

Now we can use the dataframe version of <code>describe</code> to get an overview of all of our data

## Visualizing the Data and EDA
Another important tool in the data scientist's toolbox is the ability to create visualizations from data.  Visualizing data is often the most logical place to start getting a deeper intuition of the data.  This intuition will shape and drive your analysis.

Even more important than visualizing data for your own personal benefit, it is often the job of the data scientist to use the data to tell a story.  Creating illustrative visuals that succinctly convey an idea are the best way to tell that story, especially to stakeholders with less technical skillsets.

We'll be using the plotting library matplotlib, which stands for mathematical plotting library. It is the most widely used plotting library, and has a few other packages built on top of it (like a library called seaborn) to make your plots even more beautiful and easy to use. 

We'll start by doing a bit of setup.

In [77]:
#importing matplotlib library with an alias as well as the seaborn library
import matplotlib.pyplot as plt 
import seaborn as sns

sns.set(style = 'darkgrid', color_codes = True)   # my personal style preferences

# hack to make seaborn plots bigger on jupyter notebooks
def setPlt(x = 25, y = 15):
    f, ax = plt.subplots(figsize = (x,y))
    sns.despine(f, left = True, bottom = True)

If the above does not work please follow these steps:

MAC

-Open terminal and type <code>conda remove seaborn</code> and press Enter. 
    
-Then type <code>conda install seaborn==0.9.0</code>

    
Windows

-Open Anaconda Prompt (Press Windows button and type "Anaconda Prompt")
    
-Type <code>conda remove seaborn</code>
    
-Then type <code>conda install seaborn==0.9.0</code>

You may need to restart anaconda 

#### Lets look at at amount of gas vouchers given out.

In [17]:
#create our first plot, a histogram of salaries

setPlt(25,10)

# two ways to do this

# hist = sns.countplot(data.loc[:,'Gas Voucher'])
# hist = sns.countplot(x = 'Education Level', data= data)

 Visualization is all about asking questions of the data. One thing that we could be curious the distrubution of genders 

In [18]:
setPlt()


Does insurance have anything to do with the amount of gas vouchers given out?

In [19]:
setPlt(15,9)


Looking at who has the highest amount of gas vouchers

Compare different benefits?

In [20]:
#colored scatter plot
setPlt(13,9)


There is so much more that you can learn from this data and we urge you to practice the skills that you have learned by doing more exploration on this data set. 

# Thank You !