## Lesson 3: Intro to Statistics (Part 2), Python Wrap-Up, and Tables

On the worksheet, we learned a bit more about standard deviation and discovered a couple ways to compare individual points in our dataset to one another. Now, let's discover how we can use Python to process data. Along the way, we'll apply what we learned on the worksheet with these new Python tools.

## Functions

You may be familiar with what *functions* are from your math class, where you learnt that a function takes in input value(s) and reports an output value. Functions in programming are the same! They are given an *input*, or many inputs, do something (or multiple things) to these inputs, and eventually yield an *output*, or result. The function contains instructions used to create the output from its input. 

It’s like a cow that eats grass (the input) which its body turns into milk which a dairy farmer then milks (the output).

*Why would we need functions?*

Computers are great at doing tasks over and over again because we *define* functions for them to *call*. They allow for generalization of code, so that we don't have to repeat ourselves over and over again.

In order to define a function in Python, you must have a function header that specifies the name of the function, and any arguments (a.k.a. input variables, parameters) of the function should be put in parentheses following the function header. The line must begin with the keyword *def* and end with a colon :. Just like with *if* statements, we need to indent whatever we want to put inside the function.

Suppose you're working for a textile company and want to show them how much cloth they need to cut for a square rug of fixed dimensions. Rather than calculating the perimeter of the square each time, won't it be nicer if you just used one line of code? All you need to do is change one number. Try passing different numbers into the perimeter_of_squares function and see what happens.

In [3]:
def perimeter_of_square(length):#perimeter_of_square is the name of the function and length is it's only argument
    return length * 4  #return tells the interpreter that it needs output what follows 
perimeter_of_square(45)

180

**Return vs Print:**
In the function above, you saw that we used a *return* statement to get an output value. If we had replaced *return* with *print* we would have seen the same result. However there is a huge functional difference between the two. *Return* should be used when you want to end a function and report a value so that you can use it later in your program, whereas *print* allows the function to continue running. It is Python's way of *saying* something. This means that printing doesn't actually report a value. To understand this difference, take a look at the examples below.

In [7]:
def add_2_with_print(num):
    print("Hello!")
    return num + 2

x = add_2_with_print(3)
x #Here, the function returns 5, and the variable x is set equal to the function's return value.

Hello!


5

In [8]:
def add_2_with_return(num):
    return "Hello!"
    return num + 2
add_2_with_return(3)

'Hello!'

As you can see above, in add_2_with_print, the print statement is executed and the function continues running to return 3 + 2 = 5. However in add_2_with_return, the function stops running at the first return statement, i.e. at "Hello!" and the interpreter never gets to look at the next return statement.

Let's write a function that'll allow you to tell someone your name and age! Replace the blanks with your code!

In [None]:
def name_and_age(name, age):
    ____("My name is " + name) #would you put return or print here?
    print("I am " + ___ + " years old")

name_and_age("your name here", "your age here")

Let's write a function that adds two numbers together if they are both less than 20. If this is not the case, we multiply them with each other.

In [None]:
def add_or_multiply(num1, num2):
    if num1 and num2 < 20:
        return ___________
    else:
        return ___________________

In the function above you have used an *if-else statement*. In such statements, if the first if case is True, the interpreter executes the code indented under the if case. If not, it executes the code under the else case. 

You have also now seen *And* and will see *Or* soon too. Both are boolean functions that take in two inputs. *And* returns True if both of its inputs are True, while *Or* returns True if even one of its inputs is True.

Now let's write a function that calculates the sum of digits in a two-digit number. How would you isolate the tens digit and units digit? (*Hint: use floor division // and modulo %*) <-- Do we teach modulo??

In [None]:
def sumofdigits(n):
    units_digit = #your code here
    tens_digit = #your code here
    return units_digit + tens_digit

sumofdigits(49) #should return 13


Let's code for the login page of a website. Set up a username and password, and then fill in the code for the login function so that only someone who knows the username and password can login!

In [None]:
username = 'Your username here'
password = 'Your password here'

def login(name, passw):
    if #your code here
        print("Login successful!")
    else:
        print("Sorry! Invalid username or password.")
        
login(username, password)


Now write your own function that outputs True if a number is a multiple of 3, and False if it is not. Remember to give your function a name, parameter(s) and a return statement!



^^Better example?

## Packages

A package is simply a set of useful programs that we use so that we don’t have to write the same code all the time. Think of them as collections of functions written by other people to make life easier for programmers like you! They allow you to do a lot more with your programs, all while making them shorter and more readable.

*Volunteers -> For example, imagine you write a program that generates prime numbers, you could write this into a package called “generatePrime” and publish it on the web. Then your friend, who has to write a cybersecurity program for a bank can simply type “import generatePrime” in his program and call your function to generate a large prime number for him to work with. Another friend of yours wishes to write a program to assign one prime number to each of her classmates as a part of a school project. She too, can import your package and use your function to help with her project. It is important to note here that you did not change your code at all, and yet, two different people were greatly aided by it in two completely different ways.*


**What Packages Will We Use?**

**Datascience**: This package allows us to look at data in tables, and change these tables to show us what we want.

**Numpy**: This package has functions to help us calculate important formulas and process lots of data.



When you call functions from a package, you put the package's name and the "." character before the function's name. If you're going to be using these packages a lot, retyping the name can get tiring! This is why we use the "as" keyword once we import the package. After that line of code, the package can be referred to whatever you call it. Try importing numpy as np below.


Here's another way to import a package (The * just means "everything")


In [1]:
from datascience import *

## Arrays -> TODO: something about indexing into arrays?


While there are many different types of data, we will primarily be working with arrays in this class. Arrays can contain strings or other types of values, but a single array can only contain a single kind of data. For example, the array below only contains strings.


In [None]:
import numpy as np
india = np.array(["mumbai", "delhi", "bombay", "chennai"])
india

Arrays can be used in arithmetic expressions to compute over their contents. When an array is combined with a single number, that number is combined with each element of the array. For example, below we have an array which contains the average temperature data of four cities in India. We find that there has been a 2 degree rise in average temperature of each city. To compute the new temperature of each city, we can simply add 2 to the array.

In [5]:
average_temperature = np.array([34, 36, 37, 32])
average_temperature + 2

array([36, 38, 39, 34])

Suppose we have an array that contain the marks of 10 students on their math test (out of 20 points). How would we compute their percentages? Hint: divide by a number.

In [None]:
students_marks = np.array([18, 17, 19, 15, 18, 20, 12, 17, 11, 14])
students_percentages = #insert answer here
students_percentages

The numpy package provides us with convenient functions for creating and manipulating arrays. Some examples include np.prod,which multiplies all the elements together and np.sum which adds all the elements together. Both of these functions take in an array as their argument and return a single value. We also have functions that take in arrays as arguments and return arrays of values. Examples include np.diff, which computes the difference between adjacent elements and np.round which rounds each element to the nearest whole number.
Let's calculate the mean of an array of different kinds of flowers growing in a garden, using functions.


In [6]:
flowers = np.array([12, 15, 19, 32, 20])
total = np.sum(flowers)
length = len(flowers) #len is a function that returns the length of its input
mean = total/length
mean

19.600000000000001

In the worksheet, we learnt how to convert values into standard units. Using what you've learnt from there, let's convert the annual rainfall of 5 cities into standard units, with arrays.

In [None]:
annual_rainfall = np.array([695, 958, 760, 1118, 1063])
mean_rainfall = ________________
variance = __________________
standard_deviation = np.sqrt(variance)
mean_rainfall_standard = ____________
mean_rainfall_standard

## Tables

Think back to the first session. We did a lot of work with *tables*. Think of tables as an organized way of representing data. Tables have *rows* (read horizontally) and *columns* (read vertically).

Rows contain all the data about one item in the table, and columns represent one *attribute* of our data set. Here's an example.

In [38]:
t = Table().with_columns("Name", np.array(["Rohan", "Meena", "Priya", "Ahmed"]),
                     "Age", np.array([18, 16, 17, 15]),
                     "Favorite Activity", np.array(["Tabla", "Football", "Sitar", "Cricket"]),
                     "Favorite Fruit", np.array(["Mango", "Lychee", "Guava", "Mango"]))
t

Name,Age,Favorite Activity,Favorite Fruit
Rohan,18,Tabla,Mango
Meena,16,Football,Lychee
Priya,17,Sitar,Guava
Ahmed,15,Cricket,Mango


Try seeing if you can fill in this table with data about yourself and three other students. 

In [None]:
names = np.array(["_______", "_______", "__________", "_________"])
ages = np.array([_______, ______, ________, ________])
activities = np.array(["_______", "_______", "__________", "_________"])
fruits = np.array(["_______", "_______", "__________", "_________"])

your_data = Table().with_columns("Name", names,
                     "Age", ages,
                     "Favorite Activity", activities,
                     "Favorite Fruit", fruits)

In Python, we can write expressions to help us work with tables.

To access data in a column, we can say the following statements. We have to refer to the table by name before calling the *column* function, otherwise Python won't know which table we're referring to!

In [39]:
t.column(0)

array(['Rohan', 'Meena', 'Priya', 'Ahmed'], 
      dtype='<U5')

In [40]:
t.column("Name")

array(['Rohan', 'Meena', 'Priya', 'Ahmed'], 
      dtype='<U5')

Note how we were able to pass in the column name, as well as its position in the table (it's the first column and we count from zero), to get an array of the names. Here, try accessing the names in your table.

Python can also show us how many entries are in our table with the num_rows call. Here, we can see that our table contains data about four people.

In [50]:
t.num_rows

4

Sometimes, we need to *sort* our data by putting it into a meaningful order. Say we want to find the name of the oldest student.

In [41]:
t.sort("Age", descending = True).column("Name")[0]

'Rohan'

Write a line of code that sorts your table such that the youngest student is on top.

Suppose you're at a marketing company trying to see what people your age are interested in. Here's where the *select* statement comes in handy.

In [42]:
t.select("Favorite Fruit", 2)

Favorite Fruit,Favorite Activity
Mango,Tabla
Lychee,Football
Guava,Sitar
Mango,Cricket


Select one or two columns from your table here.

Let's add a column to our table.

In [46]:
t2 = t.with_column("Activity Type", np.array(["Music", "Sport", "Music", "Sport"]))
t2

Name,Age,Favorite Activity,Favorite Fruit,Activity Type
Rohan,18,Tabla,Mango,Music
Meena,16,Football,Lychee,Sport
Priya,17,Sitar,Guava,Music
Ahmed,15,Cricket,Mango,Sport


You can also look at data that satisfies a certain condition with the *where* function. In this example, you're the school's coach and want to look for players to pick for the sports teams with the following clause.

In [48]:
t2.where("Activity Type", are.equal_to("Sport"))

Name,Age,Favorite Activity,Favorite Fruit,Activity Type
Meena,16,Football,Lychee,Sport
Ahmed,15,Cricket,Mango,Sport


Find the people in your table who are older than 16. Hint: Use the "are.greater_than" function.