One of the reasons that the Python language is extremely popular is that it makes writing programs easy. Because Python is a high-level language, we don't have to worry about things like allocating memory on our computer or choosing how certain operations are done by our computer's processor. In contrast, when we use low-level languages like C, we define exactly how memory will be managed and how the processor will execute our instructions. This means that coding in a low-level language takes longer; however, we have more ability to optimize our code to run faster.

We used lists of lists to represent data sets. While lists of lists are sufficient for working with small data sets, they aren't very good for working with larger data sets. The NumPy library solves this problem.

In [4]:
lolst = [[1,2],[0,1],[2,1],[1,1]]

sums = []
for i in lolst:
    add = i[0] + i[1]
    sums.append(add)
sums
    

[3, 1, 3, 2]

In each iteration of our loop, the Python interpreter turns our code into bytecode, and the bytecode asks our computer's processor to add the two numbers together

Our computer would take four processor cycles to process the four rows of our data.

The NumPy library takes advantage of a processor feature called **Single Instruction Multiple Data (SIMD)** to process data faster. SIMD allows a processor to perform the same operation, on multiple data points, in a single processor cycle

This concept of replacing for loops with operations applied to multiple data points at once is called **vectorization**.

The core data structure in NumPy that makes vectorization possible is the **ndarray or n-dimensional array**. In programming, array describes a collection of elements, similar to a list. The word n-dimensional refers to the fact that ndarrays can have one or more dimensions.

In [3]:
import numpy as np

data_ndarray = np.array([10, 20, 30])
data_ndarray.dtype


dtype('int32')

We'll analyze taxi trip data released by the city of New York.

Approximately 90,000 yellow taxi trips to and from New York City airports between January and June 2016. Below is information about selected columns from the data set:

* `pickup_year`: The year of the trip.
* `pickup_month`: The month of the trip (January is 1, December is 12).
* `pickup_day`: The day of the month of the trip.
* `pickup_location_code`: The airport or borough where the trip started.
* `dropoff_location_code`: The airport or borough where the trip finished.
* `trip_distance`: The distance of the trip in miles.
* `trip_length`: The length of the trip in seconds.
* `fare_amount`: The base fare of the trip, in dollars.
* `total_amount`: The total amount charged to the passenger, including all fees, tolls and tips.

In [3]:
from csv import reader

data = list(reader(open("nyc_taxis.csv")))

data = data[1:]

converted_list = []

for i in data:
    lst = []
    for item in i:
        lst.append(float(item))
    converted_list.append(lst)

        

In [12]:
import numpy as np

# Method 1
# taxi = np.array(data, dtype = np.float64)
# taxi.astype("float64")

# Method 2
taxi = np.array(converted_list)
print(taxi.dtype)

# Method 3
# taxi = np.genfromtxt("nyc_taxis.csv", skip_header = True, delimiter = ",")
# print(taxi.dtype) 

The elipses (...) between rows and columns indicate that there is more data in our NumPy ndarray than can easily be printed.

When we can't easily print the entire ndarray, we can use the `ndarray.shape` attribute instead

The data type returned is called a **tuple**. Tuples are very similar to Python lists, but can't be modified.

The output gives us a few important pieces of information:

* The first number tells us that there are rows in `data_ndarray`.
* The second number tells us that there are columns in `data_ndarray`.

In [13]:
taxi_shape = taxi.shape
taxi_shape

(89560, 15)

In [14]:
row_0 = taxi[0] # Select the row at index 0
rows_391_to_500 = taxi[391:501] # Select every column for the rows at indexes 391 to 500 inclusive
row_21_column_5 = taxi[21,5] # Select the item at row index 21 and column index 5

In [15]:
columns_1_4_7 = taxi[:,[1,4,7]] # Select every row for the columns at indexes 1, 4, and 7
row_99_columns_5_to_8 = taxi[99,5:9] # Select the columns at indexes 5 to 8 inclusive for the row at index 99
rows_100_to_200_column_14 = taxi[100:201,14] # Select the rows at indexes 100 to 200 inclusive for the column at index 14

In [16]:
fare_and_fees = taxi[:,9] + taxi[:,10] # add two columns

The result of adding two 1D ndarrays is a 1D ndarray of the same shape (or dimensions) as the original. In this context, ndarrays can also be called **vectors**, a term taken from a branch of mathematics called **linear algebra**. What we just did, adding two vectors together, is called **vector addition**.

 We can actually use any of the standard Python numeric operators with vectors, including:

* vector_a + vector_b - Addition
* vector_a - vector_b - Subtraction
* vector_a * vector_b - Multiplication (this is unrelated to the vector multiplication used in linear algebra).
* vector_a / vector_b - Division

When we perform these operations on two 1D vectors, both vectors must have the same shape.

In [17]:
# Calculate the miles per hour

trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]
trip_mph = trip_distance_miles/(trip_length_seconds/3600) # # 3600 seconds in one hour

Numpy ndarrays have methods for many different calculations. A few key methods are:

* `ndarray.min()` to calculate the minimum value
* `ndarray.max()` to calculate the maximum value
* `ndarray.mean()` to calculate the mean or average value
* `ndarray.sum()` to calculate the sum of the values

In [21]:
mph_max = trip_mph.max() 
mph_mean = trip_mph.mean()
mph_min = trip_mph.min() 

print(round(mph_max,ndigits =2))
print(round(mph_mean,ndigits =2))
print(round(mph_min,ndigits =2))

82800.0
32.24
0.0


A trip speed of 82,000 mph is definitely not possible in New York traffic - that's almost 20x faster than the fastest plane in the world! This could be due to an error in the devices that records the data, or perhaps errors made somewhere in the data pipeline

# Review the difference between methods and functions. 

Functions act as stand alone segments of code that usually take an input, perform some processing, and return some output. For example, we can use the `len()` function to calculate the length of a list or the number of characters in a string.

In contrast, methods are special functions that belong to a specific type of object. This means that, for instance, when we work with list objects, there are special functions or methods that can only be used with lists. For example, we can use the `list.append()` method to add an item to the end of a list. If we try to use that method on a string, we will get an error:

In NumPy, sometimes there are operations that are implemented as both methods and functions, which can be confusing.

To remember the right terminology, anything that starts with np (e.g. np.mean()) is a function and anything expressed with an object (or variable) name first (e.g. `trip_mph.mean()`) is a method. When both exist, it's up to us to decide which to use, but it's much more common to use the method approach.

We'll calculate statistics for 2D ndarrays. If we use the `ndarray.max()` method on a 2D ndarray without any additional parameters, it will return a single value, just like with a 1D array:

But what if we wanted to find the maximum value of each row? We'd need to use the axis parameter and specify a value of 1 to indicate we want to calculate the maximum value for each row.

If we want to find the maximum value of each column, we'd use an axis value of 0:

In [22]:
taxi_first_five = taxi[:5] # first five rows 
fare_components = taxi[:5,9:13] # select these columns: fare_amount, fees_amount, tolls_amount, tip_amount

fare_sums = fare_components.sum(axis = 1)
fare_totals = taxi_first_five[:,13]

# compare the summed columns to the fare_totals
print(fare_sums)
print(fare_totals)


[69.99 54.3  37.8  32.76 18.8 ]
[69.99 54.3  37.8  32.76 18.8 ]
