# Intro to Python workshop

## Afsaneh Towhidi (Sunny)


### Refrences:

1. http://cs231n.github.io/python-numpy-tutorial/
2. Problem Solving with Algorithms and Data Structures using Python by by Bradley N. Miller and David L. Ranum
3. http://www.hedaro.com/pandas-tutorial

## Python
- An interpreted high-level programming language.
- A modern, easy-to-learn, object-oriented programming language.
- Has a powerful set of built-in data types and easy-to-use control constructs.
- Easily reviewed by simply looking at and describing interactive sessions.

In [None]:
print('Hello world!')

Python code is often said to be almost like pseudocode, since it allows you to express very powerful ideas in very few lines of code while being very readable. 

## Some facts about python 
- Python supports the object-oriented programming paradigm
- Python considers data to be the focal point of the problem-solving process
- We define a class to be a description of
    - The state : What the data look like
    - The behavior: What the data can do
- A user of a class only sees the state and behavior of a data item, so classes are analogous to abstract data types
- Data items are called objects
- An object is an instance of a class

### Python versions
There are currently two different supported versions of Python, 2.7 and 3.6. Somewhat confusingly, Python 3.0 introduced many backwards-incompatible changes to the language, so code written for 2.7 may not work under 3.6 and vice versa. For this workshop all code will use Python 3.6.

## Anaconda
The open source Anaconda Distribution is the easiest way to do Python data science and machine learning. It includes 250+ popular data science packages and the conda package and virtual environment manager for Windows, Linux, and MacOS. 

Conda makes it quick and easy to install, run, and upgrade complex data science and machine learning environments like Scikit-learn, TensorFlow, and SciPy.

## Built-in Atomic Data Types
- Atomic data are data elements that represent the lowest level of detail.
- Python has two main built-in numeric classes that implement the integer and floating point data types $=>$ $int$ and $float$ classes
- The standard arithmetic operations:
    - +, -, *, /, and $**$ (exponentiation)
    - / when two integers are divided, the result is a floating point
-  Other very useful operations
    - remainder (modulo) operator %
    - integer division //   : returns the integer portion of the quotient by truncating any fractional part


### Basic data types

### Numbers

In [None]:
x = 3
print (x)
type(x)

In [None]:
print(2+3*4)
print((2+3)*4)
print(2**10)
print(6/3)
print(7/3)
print(7//3)
print(7%3) # remainder (modulo) operator %
print(3/6)
print(3//6) # integer division // : returns the integer portion of the quotient by truncating any fractional part
print(3%6)
print(2**100)

In [None]:
x

In [None]:
x += 1
print (x) 
x *= 2
print (x)

Note that unlike many languages, Python does not have unary increment (x++) or decrement (x--) operators.

In [None]:
## test float numbers



### Boolean
- The boolean data type $=> bool$ class
- The possible state values for a boolean object are $True$ and $False$.
- The standard boolean operators, $and$, $or$, and $not$. (Rather than symbols (&&, ||, etc.))

In [None]:
print(True)
print(False)
print(False or True) # Logical OR;
print(not (False or True)) # Logical NOT;
print(True and True) # Logical AND;
print (True != False)  # Logical XOR;

In [None]:
# assign 2 to variable 'a'

# assign 3 to variable 'b'

# assign 2 to variable 'c'

# check if c is equal to either a or b

# check if a, b, c are all equal.


### Strings
- Sequential collections of zero or more letters, numbers and other symbols.
- We call these letters, numbers and other symbols characters.

In [None]:
h = 'hello'   # String literals can use single quotes
w = "world"   # or double quotes; it does not matter.
print (h, len(h))

####  String concatenation

In [None]:
hw = h + ' ' + w
print (hw)

String objects have a bunch of useful methods; for example:

In [None]:
s = "hello"
print (s.capitalize())  # Capitalize a string
print (s.upper())       # Convert a string to uppercase
print (s.rjust(7))     # Right-justify a string, padding with spaces
print (s.center(7))     # Center a string, padding with spaces; prints " hello "
print (s.replace('l', '(ell)'))  # Replace all instances of one substring with another;
print ('  world '.strip()) # Strip leading and trailing whitespace

## Containers

### Lists
- A list is an ordered collection of zero or more references to Python data objects.
- An empty list: []
- In python's list, data objects need not all be from the same class and the collection can be assigned to a variable.


In [None]:
print([1,3,True,6.5])
#  When python evaluates a list, the list itself is returned

myList = [1,3,True,6.5]
# in order to remember the list for later processing, its reference needs to be assigned to a variable.
print(myList)

In [None]:
print(myList[3])

In [None]:
myList.append(4.5)
print(myList)

In [None]:
x = myList.pop()     # Remove and return the last element of the list
print (x, myList) 

In [None]:
# define a list containing these numbers: 3, 9, 2.5, 8.3, 12.2, 3 and assign it to variable newList


# add 14.5 to this list. Use print to check your code.



#reverse the list using reverse method


# use sort method to sort this list



#count the frequency of 3 in the list


#remove 9 from the list using remove method


#### slicing
In addition to accessing list elements one at a time, Python provides concise syntax to access sublists; this is known as slicing:

In [None]:
nums = [1,7,3,2,9]
print (nums)
print (nums[2:4])    # Get a slice from index 2 to 4 (exclusive)
print (nums[2:])     # Get a slice from index 2 to the end
print (nums[:2])     # Get a slice from the start to index 2 (exclusive)
print (nums[:])      # Get a slice of the whole list
print (nums[:-1])    # Slice indices can be negative
nums[2:4] = [8, 9] # Assign a new sublist to a slice
print (nums)

In [None]:
lst = ['a', 'b', 'd', 'r', 'e']

#print index 1

#print the last element of lst

# Get a slice from index 2 to 3

# Get a slice from index 3 to the end

#### Loops

You can loop over the elements of a list like this:

In [None]:
animals = ['cat', 'dog', 'monkey']
for animal in animals:
    print (animal)

If you want access to the index of each element within the body of a loop, use the built-in enumerate function:

In [None]:
animals = ['cat', 'dog', 'monkey']
for index, animal in enumerate(animals):
    print (index + 1, animal)

#### List comprehensions:

In [None]:
nums = [0, 1, 2, 3, 4]
squares = []
for x in nums:
    squares.append(x ** 2)
print (squares)

In [None]:
nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]
print (squares)

List comprehensions can also contain conditions:

In [None]:
nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]
print (even_squares)

In [None]:
## an example of list comprehension

numbers = [1, 2, 3, 4, 5]

doubled_odds = []
for n in numbers:
    if n % 2 == 1:
        doubled_odds.append(n * 2)
        

### Dictionaries
- Collections of associated pairs of items where each pair consists of a key and a value.
- key:value
- Dictionaries are written as comma-delimited key:value pairs enclosed in curly braces.

In [None]:
capitals = {'Ontario':'Toronto','British columbia':'Victoria'}
print(capitals)

In [None]:
# We can manipulate a dictionary by accessing a value via its key or by adding another key-value pair.

print(capitals['Ontario'])

In [None]:
#adding a new key:value
capitals['Manitoba']='Winnipeg'
print(capitals)

In [None]:
# add Alberta and its capital Edmonton to this dictionary


In [None]:
capitals['Saskatchewan']

In [None]:
print (capitals.get('Saskatchewan', 'N/A'))  # Get an element with a default
print (capitals.get('Ontario', 'N/A'))    # Get an element with a default

It is easy to iterate over the keys in a dictionary:

In [None]:
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
    legs = d[animal]
    print (f'A {animal} has {legs} legs')

We also can apply something like list comprehension on dictionaries:

In [None]:
nums = [0, 1, 2, 3, 4]
even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0}
print (even_num_to_square)

### Sets
- Unordered collection of zero or more immutable Python data objects
- Sets do not allow duplicates and are written as comma-delimited values enclosed in curly braces
- The empty set is represented by set().
- Sets are heterogeneous

In [None]:
animals = {'cat', 'dog'}
print ('cat' in animals)   # Check if an element is in a set
print ('fish' in animals)
print(animals)

In [None]:
animals.add('fish')      # Add an element to a set
print ('fish' in animals)
print (animals)      # Number of elements in a set

In [None]:
animals.add('cat')     # Adding an element that is already in the set does nothing
print ((animals))     
animals.remove('cat')    # Remove an element from a set
print(animals)

### Tuples

- Tuples are used for grouping data. Each element or value that is inside of a tuple is called an item.
- A tuple is in many ways similar to a list; one of the most important differences is that tuples can be used as keys in dictionaries and as elements of sets, while lists cannot.
- Very similar to lists in that they are heterogeneous sequences of data.
- It is an immutable, or unchangeable, ordered sequence of elements. Because tuples are immutable, their values cannot be modified.
- Tuples are written as comma-delimited values enclosed in parentheses.

In [None]:
coral = ('blue coral', 'staghorn coral', 'pillar coral', 'elkhorn coral')

print(coral[2])

In [None]:
d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys
print (d)

In [None]:
t = (5, 6)       # Create a tuple
print (type(t))
print (d[t])

In [None]:
print (d[(1, 2)])

## Functions

- A function definition requires a name, a group of parameters, and a body.

In [None]:
def sign(x):
    if x > 0:
        return 'positive'
    elif x < 0:
        return 'negative'
    else:
        return 'zero'

sign(-3)

In [None]:
for x in [-1, 0, 1]:
    print (sign(x))

????? We will often define functions to take optional keyword arguments, like this:

In [None]:
def hello(name, loud=False):
    if loud:
        print(f'HELLO, {name.upper()}')
    else:
        print(f'Hello, {name}')

hello('Bob')
hello('Fred', loud=True)

## Numpy

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this tutorial useful to get started with Numpy:
https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html

To use Numpy, we first need to import the numpy package:

In [None]:
import numpy as np

### Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

In [None]:
a = np.array([1, 2, 3])  # Create a rank 1 array
print ('type: ', type(a))
print('shape: ', a.shape)
print (a[0], a[1], a[2])
a[0] = 5                 # Change an element of the array
print (a)

In [None]:
b = np.array([[1,2,3],[4,5,6]])   # Create a rank 2 array
print (b)

In [None]:
print (b.shape)                   
print (b[0, 0], b[0, 1], b[1, 0])

Numpy also provides many functions to create arrays:

In [None]:
a = np.zeros((2,2))   # Create an array of all zeros
print(a)

In [None]:
b = np.ones((1,2))    # Create an array of all ones
print(b)

In [None]:
c = np.full((2,2), 7)  # Create a constant array
print(c)

In [None]:
d = np.eye(2)         # Create a 2x2 identity matrix
print(d)

In [None]:
e = np.random.random((2,2))  # Create an array filled with random values
print(e)

### Array indexing

Numpy offers several ways to index into arrays.

**Slicing:** Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print('a:\n ', a)

In [None]:
b = a[:2, 1:3]

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
print('\nb: \n', b) 

In [None]:
# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print('\na[0, 1]:\n',a[0, 1])

b[0, 0] = 77

print(a[0, 1])

In [None]:
# # Create the following rank 2 array
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]



# print the shape

# print the second row and print its shape

# print the second column 


**Boolean array indexing:** Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:

In [None]:
a = np.array([[1,2], [3, 4], [5, 6]])
bool_idx = (a > 2)
#Find the elements of a that are bigger than 2;
# this returns a numpy array of Booleans of the same
# shape as a, where each slot of bool_idx tells
# whether that element of a is > 2.

print(bool_idx)

In [None]:
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx

print(a[bool_idx])

In [None]:
# We can do all of the above in a single concise statement:
print(a[a > 2])

### Datatypes
Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:

In [None]:
import numpy as np

x = np.array([1, 2])   # Let numpy choose the datatype
print(x.dtype)         # Prints "int64"

x = np.array([1.0, 2.0])   # Let numpy choose the datatype
print(x.dtype)             # Prints "float64"

x = np.array([1, 2], dtype=np.int64)   # Force a particular datatype
print(x.dtype)      

### Arithmetic opreations

In [None]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

In [None]:
# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

In [None]:
# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

In [None]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

In [None]:
# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

Note that unlike MATLAB, * is elementwise multiplication, not matrix multiplication. We instead use the dot function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices. dot is available both as a function in the numpy module and as an instance method of array objects:

In [None]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

In [None]:
# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))

In [None]:
# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))

Numpy provides many useful functions for performing computations on arrays; one of the most useful is sum:

In [None]:
x = np.array([[1,2],[3,4]])

print(np.sum(x))  # Compute sum of all elements

In [None]:
print(np.sum(x, axis=0))  # Compute sum of each column

In [None]:
print(np.sum(x, axis=1))  # Compute sum of each row

to transpose a matrix, simply use the T attribute of an array object:

In [None]:
x = np.array([[1,2], [3,4]])
print(x)    
            
print('\n', x.T)  

# SciPy

Numpy provides a high-performance multidimensional array and basic tools to compute with and manipulate these arrays. SciPy builds on this, and provides a large number of functions that operate on numpy arrays and are useful for different types of scientific and engineering applications.

### MATLAB files

The functions ***scipy.io.loadmat*** and ***scipy.io.savemat*** allow you to read and write MATLAB files.

# Matplotlib
Matplotlib is a plotting library. In this section give a brief introduction to the matplotlib.pyplot module, which provides a plotting system similar to that of MATLAB.

### Plotting
The most important function in matplotlib is plot, which allows you to plot 2D data. Here is a simple example:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)

# Plot the points using matplotlib
plt.plot(x, y)
plt.show()

With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend, and axis labels:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()

### Subplots
You can plot different things in the same figure using the subplot function. Here is an example:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()

# Pandas

A general introduction to pandas:

In [None]:
# Import all libraries needed for the tutorial

# General syntax to import specific functions in a library: 
##from (library) import (specific library function)
import pandas as pd

# General syntax to import a library but no functions: 
##import (library) as (give the library a nickname/alias)
from numpy import random
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number
import matplotlib #only needed to determine Matplotlib version number

# Enable inline plotting
%matplotlib inline



** Main data types: Series & DataFrame.**

## Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)

In [None]:
pd.Series([1,2,3,4])

In [None]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

## Dataframe
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

- Create Data - We begin by creating our own data set for analysis.

- Get Data - We will learn how to read in the text file. The data consist of baby names and the number of baby names born in the year 1880.

- Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean I mean we will take a look inside the contents of the text file and look for any anomalities. These can include missing data, inconsistencies in the data, or any other data that seems out of place. If any are found we will then have to make decisions on what to do with these records.

- Analyze Data - We will simply find the most popular name in a specific year.

- Present Data - Through tabular data and a graph, clearly show the end user what is the most popular name in a specific year.

# Create Data  

The data set will consist of 5 baby names and the number of births recorded for that year (1880).

In [None]:
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]

To merge these two lists together we will use the ***zip*** function.

In [None]:
BabyDataSet = list(zip(names,births))
BabyDataSet

We are basically done creating the data set. We now will use the ***pandas*** library to export this data set into a csv file. 

***df*** will be a ***DataFrame*** object. You can think of this object holding the contents of the BabyDataSet in a format similar to a sql table or an excel spreadsheet. Lets take a look below at the contents inside ***df***.

In [None]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df

Export the dataframe to a ***csv*** file. We can name the file ***births1880.csv***. The function ***to_csv*** will be used to export the file. The file will be saved in the same location of the notebook unless specified otherwise.

The only parameters we will use is ***index*** and ***header***. Setting these parameters to False will prevent the index and header names from being exported.

In [None]:
df.to_csv('births1880.csv',index=False,header=False)

In [None]:
# Change the values of these parameters to get a better understanding of their use



## Get Data

To pull in the csv file, we will use the pandas function *read_csv*. Let us take a look at this function and what inputs it takes.

Even though this functions has many parameters, we will simply pass it the location of the text file.  

Location = ' '

***Note:*** Depending on where you save your notebooks, you may need to modify the location above.  

In [None]:
# Location = r' ' 
df = pd.read_csv(Location)

Notice the ***r*** before the string. Since the slashes are special characters, prefixing the string with a ***r*** will escape the whole string.  

In [None]:
df

This brings us to the first problem of the exercise. The ***read_csv*** function treated the first record in the csv file as the header names. This is obviously not correct since the text file did not provide us with header names.  

To correct this we will pass the ***header*** parameter to the *read_csv* function and set it to ***None*** (means null in python).

In [None]:
df = pd.read_csv(Location, header=None)
df

If we wanted to give the columns specific names, we would have to pass another paramter called ***names***. We can also omit the *header* parameter.

In [None]:
df = pd.read_csv(Location, names=['Names','Births'])
df

## Prepare Data

The data we have consists of baby names and the number of births in the year 1880. We already know that we have 5 records and none of the records are missing (non-null values).  

The ***Names*** column at this point is of no concern since it most likely is just composed of alpha numeric strings (baby names). There is a chance of bad data in this column but we will not worry about that at this point of the analysis. The ***Births*** column should just contain integers representing the number of babies born in a specific year with a specific name. We can check if the all the data is of the data type integer. It would not make sense to have this column have a data type of float. I would not worry about any possible outliers at this point of the analysis.  

Realize that aside from the check we did on the "Names" column, briefly looking at the data inside the dataframe should be as far as we need to go at this stage of the game. As we continue in the data analysis life cycle we will have plenty of opportunities to find any issues with the data set.

In [None]:
# Check data type of the columns
df.dtypes

In [None]:
# Check data type of Births column
df.Births.dtype

As you can see the *Births* column is of type ***int64***, thus no floats (decimal numbers) or alpha numeric characters will be present in this column.

## Analyze Data

To find the most popular name or the baby name with the higest birth rate, we can do one of the following.  

* Sort the dataframe and select the top row
* Use the ***max()*** attribute to find the maximum value

In [None]:
# Method 1:
Sorted = df.sort_values(['Births'], ascending=False)
Sorted.head(1)

In [None]:
# Method 2:
df['Births'].max()

### A more complex exmaple

# Create Data  

The data set will consist of 1,000 baby names and the number of births recorded for that year (1880). We will also add plenty of duplicates so you will see the same baby name more than once. You can think of the multiple entries per name simply being different hospitals around the country reporting the number of births per baby name. So if two hospitals reported the baby name "Bob", the data will have two values for the name Bob. We will start by creating the random set of baby names. 

In [None]:
# The inital set of baby names
names = ['Bob','Jessica','Mary','John','Mel']

To make a random list of 1,000 baby names using the five above we will do the following:  

* Generate a random number between 0 and 4  

To do this we will be using the functions ***seed***, ***randint***, ***len***, ***range***, and ***zip***.   

**seed(500)** - Create seed

**randint(low=0,high=len(names))** - Generate a random integer between zero and the length of the list "names".    

**names[n]** - Select the name where its index is equal to n.  

**for i in range(n)** - Loop until i is equal to n, i.e. 1,2,3,....n.  

**random_names** = Select a random name from the name list and do this n times.  

In [None]:
random.seed(500)
random_names = [names[random.randint(low=0,high=len(names))] for i in range(1000)]

# Print first 10 records
random_names[:10]

Generate a random numbers between 0 and 1000    

In [None]:
# The number of births per name for the year 1880
births = [random.randint(low=0,high=1000) for i in range(1000)]
births[:10]

Merge the ***names*** and the ***births*** data set using the ***zip*** function.

In [None]:
BabyDataSet = list(zip(random_names,births))
BabyDataSet[:10]

In [None]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df[:10]

* Export the dataframe to a ***text*** file. We can name the file ***births1880.txt***. The function ***to_csv*** will be used to export. The file will be saved in the same location of the notebook unless specified otherwise.

In [None]:
df.to_csv('births1880.txt',index=False,header=False)

## Get Data

To pull in the text file, we will use the pandas function *read_csv*. Let us take a look at this function and what inputs it takes.

In [None]:
Location = ' '
df = pd.read_csv(Location)

In [None]:
df.info()

Info says:  

* There are ***999*** records in the data set  
* There is a column named ***Mary*** with 999 values  
* There is a column named ***968*** with 999 values  
* Out of the ***two*** columns, one is ***numeric***, the other is ***non numeric***  

To actually see the contents of the dataframe we can use the ***head()*** function which by default will return the first five records. You can also pass in a number n to return the top n records of the dataframe. 

In [None]:
df.head()

This brings us to our first problem of the exercise. The ***read_csv*** function treated the first record in the text file as the header names. This is obviously not correct since the text file did not provide us with header names.  

To correct this we will pass the ***header*** parameter to the *read_csv* function and set it to ***None*** (means null in python).

In [None]:
df = pd.read_csv(Location, header=None)
df.info()

Info now says:  
* There are ***1000*** records in the data set  
* There is a column named ***0*** with 1000 values  
* There is a column named ***1*** with 1000 values  
* Out of the ***two*** columns, one is ***numeric***, the other is ***non numeric***  

Now lets take a look at the last five records of the dataframe

In [None]:
df.tail()

If we wanted to give the columns specific names, we would have to pass another paramter called ***names***. We can also omit the *header* parameter.

In [None]:
df = pd.read_csv(Location, names=['Names','Births'])
df.head(5)

You can think of the numbers [0,1,2,3,4,...] as the row numbers in an Excel file. In pandas these are part of the ***index*** of the dataframe. You can think of the index as the primary key of a sql table with the exception that an index is allowed to have duplicates.  

***[Names, Births]*** can be though of as column headers similar to the ones found in an Excel spreadsheet or sql database.

## Prepare Data

The data we have consists of baby names and the number of births in the year 1880. We already know that we have 1,000 records and none of the records are missing (non-null values). We can verify the "Names" column still only has five unique names.  

We can use the ***unique*** property of the dataframe to find all the unique records of the "Names" column.

In [None]:
# Method 1:
df['Names'].unique()

In [None]:
# If you actually want to print the unique values:
for x in df['Names'].unique():
    print(x)

In [None]:
# Method 2:
print(df['Names'].describe())

Since we have multiple values per baby name, we need to aggregate this data so we only have a baby name appear once. This means the 1,000 rows will need to become 5. We can accomplish this by using the ***groupby*** function. 

In [None]:
# Create a groupby object
name = df.groupby('Names')

# Apply the sum function to the groupby object
df = name.sum()
df

## Analyze Data

To find the most popular name or the baby name with the higest birth rate, we can do one of the following.  

* Sort the dataframe and select the top row
* Use the ***max()*** attribute to find the maximum value

In [None]:
# Method 1:
Sorted = df.sort_values(['Births'], ascending=False)
Sorted.head(1)

In [None]:
# Method 2:
df['Births'].max()

## Present Data

Here we can plot the ***Births*** column and label the graph to show the end user the highest point on the graph. In conjunction with the table, the end user has a clear picture that **Bob** is the most popular baby name in the data set. 

In [None]:
# Create graph
df['Births'].plot.bar()

print("The most popular name")
df.sort_values(by='Births', ascending=False)