## Fuentes
* Este contenido fue tomado del curso CS109 Data Science - Harvard School of Engineering and Applied Sciences
<a href=http://cs109.github.io/2014/>CS109 Data Science - 2014</a>

* Adaptaciones menores hechas para nuestro curso.

In [None]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import pandas as pd
import matplotlib.pyplot as plt

# pd.options.display.mpl_style = 'default'
plt.style.use('ggplot')

## This notebook will discuss the following:
* Reading in a CSV file into a pandas DataFrame
* Using histograms, scatterplots and boxplots as exploratory data analysis
* Summary statistics
* Functions to access a pandas DataFrame
* Defining your own functions and using loops

<a href=https://raw.githubusercontent.com/cs109/2014/master/labs/Lab2_Notes.ipynb download=HW1.ipynb> Download the original notebook from Github </a>

#### Important: Tips for good Python Coding Practices 

1. Always comment the code, avoiding inline comments
2. Define functions to do commands that you have to do repeatedly; use docstrings when defining a function.
3. Use xrange instead of range when simply iterating a procedure N times (iterator instead of actual list built in memory).
4. Be aware of what parts of the code take the most time to run and plan accordingly (especially web requests, they tend to take a while)
5. Don't write long lines (we shouldn't have to scroll to see the whole line of code)


## Diamonds Data

This data set contains the prices and other attributes of almost 54,000 diamonds. This dataset is available on Github in the [2014_data repository](https://github.com/cs109/2014_data) and is called `diamonds.csv`.  


## Reading in the diamonds data (CSV file) from the web

This is a `.csv` file, so we will use the function `read_csv()` that will read in a CSV file into a pandas DataFrame. 

In [None]:
url = 'https://raw.githubusercontent.com/cs109/2014_data/master/diamonds.csv'
diamonds = pd.read_csv(url, sep = ',', index_col=0)
diamonds.head()

Here is a table containing a description of all the column names. 

Column name | Description 
--- | --- 
carat | weight of the diamond (0.2–5.01)
cut | quality of the cut (Fair, Good, Very Good, Premium, Ideal)
colour | diamond colour, from J (worst) to D (best)
clarity | a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table | width of top of diamond relative to widest point (43–95)
price | price in US dollars (\$326–\$18,823)
x | length in mm (0–10.74)
y | width in mm (0–58.9)
z | depth in mm (0–31.8)

## Exploratory Data Analysis (EDA)

The variables `carat` and `price` are both continuous variables, while `color` and `clarity` are discrete variables. First, let's look at some summary statistics of the diamonds data set. 

In [None]:
diamonds.describe()

Let's look at the distribution of carats and price using a histogram.

In [None]:
diamonds['price'].hist(bins=50, color = 'black')
plt.title('Distribution of Price')
plt.xlabel('Price')

In [None]:
# Try changing bin size from 20 to 500. What do you notice? 
diamonds['carat'].hist(bins=20, color = 'black', figsize=(6, 4))
plt.title('Distribution of weights in carats')
plt.xlim(0, 3)
plt.xlabel('Weight in Carats')

Plot the density of the price of the diamonds

In [None]:
diamonds['price'].plot(kind='kde', color = 'black')
plt.title('Distribution of Price')

Now, let's look at the relationship between the price of a diamond and its weight in carats. Try changing alpha (ranges from 0 to 1) to control over plotting. 

In [None]:
diamonds.plot(x='carat', y='price', kind = 'scatter', color = 'black', alpha = 1)

We can also create a scatter plot using matplotlib.pyplot instead of pandas directly.

In [None]:
plt.scatter(diamonds['carat'], diamonds['price'], color = 'black', alpha = 0.05)
plt.xlabel('Carat')
plt.ylabel('Price')

Let's look at the scatter plots of `price` and `carat` but grouped by color.  

In [None]:
diamonds.groupby('color').plot(x='carat', y='price', kind = 'scatter', color = 'black', alpha = 1)

What happens if you look at the scatter plots of `price` and `carat` but grouped by clarity.  

In [None]:
# try here
diamonds.groupby('clarity').plot(x='carat', y='price', kind = 'scatter', color = 'black', alpha = 1)

We could also look at boxplots of the `price` grouped by `color`.  

In [None]:
diamonds.boxplot('price', by = 'color')

## More with pandas

Now that we have done some exploratory data analysis by looking at histograms, scatter plots and boxplots let's look more about how to work with the pandas DataFrame itself.  

#### More summary statistics

We just learned about `diamonds.describe()` above, what else can we do? 

In [None]:
diamonds.mean()

In [None]:
diamonds.corr() # correlation

In [None]:
diamonds.var() # variance

In [None]:
# diamonds.sort('price', ascending = True, inplace = False).head() # sorting
diamonds.sort_index(1,'price', ascending = True, inplace = False).head() # sorting

## Accessing functions for a panda DataFrame

In [None]:
subtable = diamonds.iloc[0:2, 0:2]
print ("subtable")
print (subtable)
print ("")

column = diamonds['color']
print ("head of the color column")
print (column.head())
print ("")

row = diamonds.index[1:2] #row 1 and 2
print ("row")
print (row)
print ("")

rows = diamonds.index[:3] # all the rows before 3
print ("rows")
print (rows)
print ("")

color = diamonds.loc[1,'color']
print ("color of diamond in row 1")
print (color)
print ("")

# max along column
print ("max price %g" % diamonds['price'].max()) 
print ("")

# axes
print ("axes")
print (diamonds.axes)
print ("")

row = diamonds.index[1]
print ("row info")
# print (row.name)
# print (row.index)
print (diamonds.keys())
print ("")

## Defining your own functions

New functions can be defined using one of the 31 keywords in Python: `def`.  

In [None]:
def squared(x):
    """ Return the square of a  
        value """
    return x ** 2

squared(4)

The first line of the function (the header) must start with the keyword `def`, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon.  The arguments can be specified in any order. 

The rest of the function (the body) always has an indentation of four spaces.  If you define a function in the interactive mode, the interpreter will print ellipses (...) to let you know the function isn't complete. To complete the function, enter an empty line (not necessary in a script).  

To return a value from a function, use `return`. The function will immediately terminate and not run any code written past this point.

#### The docstring
When defining new functions, you can add a `docstring` (i.e. the documentation of function) at the beginning of the function that documents what the function does. The docstring is a triple quoted (multi-line) string.  We highly recommend you to document the functions you define as good python coding practice. 

#### Lambda functions
Lambda functions are one-line functions. To define this function using the `lambda` keyword, you do not need to include the `return` argument.  For example, we can re-write the `squared()` function above using the following syntax:

In [None]:
f = lambda x: x**2
f(4)

## For loops and while loops

#### For loops
Defining a `for` loop is similar to defining a new function. The header ends with a colon and the body is indented with four spaces. The function `range(n)` takes in an integer n and creates a set of values from 0 to n - 1.  `for` loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries. 

In [None]:
for i in range(4):
    print ('Hello world!')

To traverse through all characters in a given string, you can use `for` or `while` loops. Here we create the names of the duck statues in the Public Gardens in downtown Boston: Jack, Kack, Lack, Mack, Nack, Oack, Pack, Qack. 

In [None]:
prefixes = 'JKLMNOPQ'
suffix = 'ack'
for letter in prefixes:
    print (letter + suffix)

#### while loops
Defining a `while` loop is again similar to defining a `for` loop or new function. The header ends with a colon and the body is indented with four spaces. 

In [None]:
def countdown(n):
    while n > 0:
        print (n)
        n = n-1
    print ('Blastoff!')

countdown(3)

#### List comprehensions
Another powerful feature of Python is **list comprehension** which maps one list onto another list and applying a function to each element.  Here, we take each element in the list `a` (temporarily assigning it the value i) and square each element in the list. This creates a new list and does not modify `a`.  In the second line, we can add a conditional statements of only squaring the elements if the element is not equal to 10.

In [None]:
a = [5, 10, 15, 20]
b = [i**2 for i in a]
c = [i**2 for i in a if i != 10]

print ("a: ", a)
print ("b: ", b)
print ("c: ", c)