# Reading CSV files with Numpy

<div><p>In the previous mission, we learned how to use NumPy and vectorized operations to analyze <a href="https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page" target="_blank">taxi trip data</a> from the city of New York. We learned that NumPy makes it quick and easy to select data, and includes a number of functions and methods that make it easy to calculate statistics across the different axes (or dimensions). </p>
<p>However, what if we also wanted to find out how many trips were taken in each month?  Or which airport is the busiest? For this, we will learn a new technique: <strong>Boolean Indexing</strong>. Before we dive into what boolean indexing is and how it can help us, let's refamiliarize ourselves with our data. </p>
<p>Here are the first 5 rows of our data with column labels:</p>
<table class="dataframe">
<thead>
<tr>
<th>pickup_year</th>
<th>pickup_month</th>
<th>pickup_day</th>
<th>pickup_dayofweek</th>
<th>pickup_time</th>
<th>pickup_location_code</th>
<th>dropoff_location_code</th>
<th>trip_distance</th>
<th>trip_length</th>
<th>fare_amount</th>
<th>fees_amount</th>
<th>tolls_amount</th>
<th>tip_amount</th>
<th>total_amount</th>
<th>payment_type</th>
</tr>
</thead>
<tbody>
<tr>
<td>2016</td>
<td>1</td>
<td>1</td>
<td>5</td>
<td>0</td>
<td>2</td>
<td>4</td>
<td>21.00</td>
<td>2037</td>
<td>52.0</td>
<td>0.8</td>
<td>5.54</td>
<td>11.65</td>
<td>69.99</td>
<td>1</td>
</tr>
<tr>
<td>2016</td>
<td>1</td>
<td>1</td>
<td>5</td>
<td>0</td>
<td>2</td>
<td>1</td>
<td>16.29</td>
<td>1520</td>
<td>45.0</td>
<td>1.3</td>
<td>0.00</td>
<td>8.00</td>
<td>54.30</td>
<td>1</td>
</tr>
<tr>
<td>2016</td>
<td>1</td>
<td>1</td>
<td>5</td>
<td>0</td>
<td>2</td>
<td>6</td>
<td>12.70</td>
<td>1462</td>
<td>36.5</td>
<td>1.3</td>
<td>0.00</td>
<td>0.00</td>
<td>37.80</td>
<td>2</td>
</tr>
<tr>
<td>2016</td>
<td>1</td>
<td>1</td>
<td>5</td>
<td>0</td>
<td>2</td>
<td>6</td>
<td>8.70</td>
<td>1210</td>
<td>26.0</td>
<td>1.3</td>
<td>0.00</td>
<td>5.46</td>
<td>32.76</td>
<td>1</td>
</tr>
<tr>
<td>2016</td>
<td>1</td>
<td>1</td>
<td>5</td>
<td>0</td>
<td>2</td>
<td>6</td>
<td>5.56</td>
<td>759</td>
<td>17.5</td>
<td>1.3</td>
<td>0.00</td>
<td>0.00</td>
<td>18.80</td>
<td>2</td>
</tr>
</tbody>
</table>
<p>Below is information about selected columns from the data set:</p>
<ul>
<li><code>pickup_year</code>: The year of the trip.</li>
<li><code>pickup_month</code>: The month of the trip (January is <code>1</code>, December is <code>12</code>).</li>
<li><code>pickup_day</code>: The day of the month of the trip.</li>
<li><code>pickup_location_code</code>: The airport or <a href="https://en.wikipedia.org/wiki/Boroughs_of_New_York_City" target="_blank">borough</a> where the the trip started.</li>
<li><code>dropoff_location_code</code>: The airport or borough where the the trip finished.</li>
<li><code>trip_distance</code>: The distance of the trip in miles.</li>
<li><code>trip_length</code>: The length of the trip in seconds.</li>
<li><code>fare_amount</code>: The base fare of the trip, in dollars.</li>
<li><code>total_amount</code>: The total amount charged to the passenger, including all fees, tolls and tips.</li>
</ul>
<p>You can find information on all columns in the <a href="https://s3.amazonaws.com/dq-content/290/nyc_taxi_data_dictionary.md" target="_blank">data dictionary</a>.</p>
<p>Now that we understand NumPy a little better, let's learn how to use the <a href="http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt" target="_blank"><code>numpy.genfromtxt()</code> function</a> to read files into NumPy ndarrays. Here is the simplified syntax for the function, and an explanation of the two parameters:</p>
</div>

```
np.genfromtxt(filename, delimiter=None)
```

<div>
<ul>
<li><code>filename</code>: A positional argument, usually a string representing the path to the text file to be read.</li>
<li><code>delimiter</code>: A named argument, specifying the string used to separate each value.</li>
</ul>
<p>In this case, because we have a CSV file, the delimiter is a comma. Here's how we'd read in a file named <code>data.csv</code>:</p>
</div>

```
data = np.genfromtxt('data.csv', delimiter=',')
```

<div>
<p>Let's read our <code>nyc_taxis.csv</code> file into NumPy next.</p></div>

In [1]:
import numpy as np
taxi = np.genfromtxt("nyc_taxis.csv", delimiter=',')
taxi_shape = taxi.shape

<div><p>In the last exercise, we used the <code>numpy.genfromtxt()</code> function to read the <code>nyc_taxis.csv</code> file into NumPy, which allowed us to import the data much more quickly and efficiently than the method we used in the previous mission:</p>
</div>

```
# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)

# start writing your code below this comment
taxi = np.array(converted_taxi_list)
```

<div>
<p>You may not have noticed in the last mission that we converted all the values to floats before we converted the list of lists to a ndarray. That's because NumPy ndarrays can contain only <em>one datatype</em>.</p>
<p>We didn't have to complete this step in the last exercise, because when <code>numpy.genfromtxt()</code> reads in a file, it attempts to determine the data type of the file by looking at the values. </p>
<p>We can use the <a href="http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.dtype.html#numpy.ndarray.dtype" target="_blank"><code>ndarray.dtype</code> attribute</a> to see the internal datatype that has been used.</p>
</div>

```
print(taxi.dtype)
```
```
float64
```

<div>
<p>NumPy chose the <code>float64</code> type, since it will allow most of the values from our CSV to be read. You can think of NumPy's <code>float64</code> type as being identical to Python's <code>float</code> type (the "64" refers to the number of <a href="https://en.wikipedia.org/wiki/Bit" target="_blank">bits</a> used to store the underlying value).</p>
<p>If we review the results from the last exercise, we can see that <code>taxi</code> contains almost all numbers except for a value that we haven't seen before: <code>nan</code>.</p>
</div>

```
[[   nan    nan    nan ...,    nan    nan    nan]
 [  2016      1      1 ...,  11.65  69.99      1]
 [  2016      1      1 ...,      8   54.3      1]
 ..., 
 [  2016      6     30 ...,      5  63.34      1]
 [  2016      6     30 ...,   8.95  44.75      1]
 [  2016      6     30 ...,      0  54.84      2]]
```

<div>
<p>NaN is an acronym for <strong>Not a Number</strong> - it literally means that the value cannot be stored as a number.  It is similar to (and often referred to as a) null value, like Python's <a href="https://docs.python.org/3.4/library/constants.html#None" target="_blank"><code>None</code> constant</a>.</p>
<p>NaN is most commonly seen when a value is missing, but in this case, we have NaN values because the first line from our CSV file contains the names of each column. NumPy is unable to convert string values like <code>pickup_year</code> into the <code>float64</code> data type.</p>
<p>For now, we need to remove this header row from our ndarray. We can do this the same way we would if our data was stored in a list of lists:</p>
</div>

```
taxi = taxi[1:]
```

<div>
<p>Alternatively, we can pass an additional parameter, <code>skip_header</code>, to the <code>numpy.genfromtxt()</code> function.  The <code>skip_header</code> parameter accepts an integer, the number of rows from the start of the file to skip. Note that because this integer should be the <em>number of rows</em> and not the index, skipping the first row would require a value of <code>1</code>, not <code>0</code>.</p></div>

### Instructions

<ol>
<li>Use the <code>numpy.genfromtxt()</code> function to again read the <code>nyc_taxis.csv</code> file into NumPy, but this time, skip the first row. Assign the result to <code>taxi</code>.</li>
<li>Assign the shape of <code>taxi</code> to <code>taxi_shape</code>.</li>
<li>Use the variable inspector under the code box to view the <code>taxi</code> ndarray and its shape after you have run your code.</li>
</ol>

In [2]:
import numpy as np
taxi = np.genfromtxt("nyc_taxis.csv", delimiter=',')[1:]
taxi_shape = taxi.shape

# Boolean arrays

<div><p>In the last mission, we learned how to index — or select — data from ndarrays. In this mission, we're going to focus on arguably the most powerful method, the boolean array.  A <strong>boolean array</strong>, as the name suggests, is an array of boolean values. Boolean arrays are sometimes called <strong>boolean vectors</strong> or <strong>boolean masks</strong>.</p>
<p>You may recall that the boolean (or <code>bool</code>) type is a built-in Python type that can be one of two unique values:</p>
<ul>
<li><code>True</code></li>
<li><code>False</code></li>
</ul>
<p>You may also remember that we've used boolean values when working with Python <a href="https://docs.python.org/3.4/library/stdtypes.html#comparisons" target="_blank">comparison operators</a> like <code>==</code> (equal) <code>&gt;</code> (greater than), <code>&lt;</code> (less than), <code>!=</code> (not equal). Below are a couple examples of simple boolean operations:</p>
</div>

```
print(type(3.5) == float)
```
```
True
```
```
print(5 > 6)
```
```
False
```

<div>
<p>When we explored vector math in the first mission, we learned that an operation between a ndarray and a single value results in a new ndarray:</p>
</div>

```
print(np.array([2,4,6,8]) + 10)
```
```
[12 14 16 18]
```

<div>
<p>The <code>+ 10</code> operation is applied to each value in the array.</p>
<p>Now, let's look at what happens when we perform a <em>boolean operation</em> between an ndarray and a single value:</p>
</div>

```
print(np.array([2,4,6,8]) < 5)
```
```
[ True  True False False]
```

<div>
<p>A similar pattern occurs – each value in the array is compared to five. If the value is less than five, <code>True</code> is returned. Otherwise, <code>False</code> is returned.</p>
<p><img src="https://s3.amazonaws.com/dq-content/290/vectorized_bool.svg" alt="Vectorized boolean operation"></p>
<p>Let's practice using vectorized boolean operations to create some boolean arrays.</p></div>


### Instructions

<ol>
<li>Use vectorized boolean operations to:<ul>
<li>Evaluate whether the elements in array <code>a</code> are less than <code>3</code>. Assign the result to <code>a_bool</code>.</li>
<li>Evaluate whether the elements in array <code>b</code> are equal to <code>"blue"</code>. Assign the result to <code>b_bool</code>.</li>
<li>Evaluate whether the elements in array <code>c</code> are greater than <code>100</code>. Assign the result to <code>c_bool</code>.</li>
</ul>
</li>
<li>Once you've run your code, use the variable inspector below the code box to view each boolean array.</li>
</ol>


In [3]:
a = np.array([1, 2, 3, 4, 5])
b = np.array(["blue", "blue", "red", "blue"])
c = np.array([80.0, 103.4, 96.9, 200.3])

a_bool = a < 3
b_bool = b == "blue"
c_bool = c > 100

# Boolean indexing with 1D Ndarray

<div><p>In the last screen, we learned how to create boolean arrays using vectorized boolean operations. Next, we'll learn how to <em>index</em> (or select) using boolean arrays, known as <strong>boolean indexing</strong>. Let's use one of the examples from the previous screen.</p>
<p><img src="https://s3.amazonaws.com/dq-content/290/1d_bool_1.svg" alt="Boolean indexing 1D ndarrays 1"></p>
<p>To index using our new boolean array, we simply insert it in the square brackets, just like we would do with our other selection techniques:</p>
<p><img src="https://s3.amazonaws.com/dq-content/290/1d_bool_2.svg" alt="Boolean indexing 1D ndarrays 2"></p>
<p>The boolean array acts as a filter, so that the values corresponding to <code>True</code> become part of the result and the values corresponding to <code>False</code> are removed.</p>
<p>Let's use boolean indexing to confirm the number of taxi rides in our data set from the month of January. First, let's select just the <code>pickup_month</code> column, which is the second column in the ndarray:</p>
</div>

```

```

<div>
<p>Next, we use a boolean operation to make a boolean array, where the value <code>1</code> corresponds to January:</p>
</div>

```

```

<div>
<p>Then we use the new boolean array to select only the items from <code>pickup_month</code> that have a value of <code>1</code>:</p>
</div>

```

```

<div>
<p>Finally, we use the <code>.shape</code> attribute to find out how many items are in our <code>january</code> ndarray, which is equal to the number of taxi rides from the month of January. We'll use <code>[0]</code> to extract the value from the tuple returned by <code>.shape</code>:</p>
</div>

```

```
```

```

<div>
<p>There are 800 rides in our dataset from the month of January.  Let's practice boolean indexing and find out the number of rides in our data set for February.</p></div>

### Instructions 

<ol>
<li>Calculate the number of rides in the <code>taxi</code> ndarray that are from February:<ul>
<li>Create a boolean array, <code>february_bool</code>, that evaluates whether the items in <code>pickup_month</code> are equal to <code>2</code>.</li>
<li>Use the <code>february_bool</code> boolean array to index <code>pickup_month</code>. Assign the result to <code>february</code>.</li>
<li>Use the <code>ndarray.shape</code> attribute to find the number of items in <code>february</code>. Assign the result to <code>february_rides</code>.</li>
</ul>
</li>
<li>Once you have run your code, use the variable inspector to view the number of rides for February.</li>
</ol>

In [4]:
pickup_month = taxi[:,1]

january_bool = pickup_month == 1
january = pickup_month[january_bool]
january_rides = january.shape[0]

february_bool = pickup_month == 2
february = pickup_month[february_bool]
february_rides = february.shape[0]

# Boolean indexing with 2D Ndarrays

<div><p>When working with 2D ndarrays, you can use boolean indexing in combination with any of the indexing methods we learned in the previous mission.  The only limitation is that the boolean array must have the same length as the dimension you're indexing.  Let's look at some examples:</p>
<p><img src="https://s3.amazonaws.com/dq-content/290/bool_dims_updated.svg" alt="Boolean indexing 1D ndarrays 2"></p>
<p>Because a boolean array contains no information about how it was created, we can use a boolean array made from just one column of our array to index the whole array.</p>
<p>Let's use what we've learned to analyze the average speed of trips. In the previous mission, we calculated the maximum trip speed to be 82,000 mph, which we know is definitely not accurate. Let's verify if there are any issues with the data. Recall that we calculated the average travel speed as follows:</p>
</div>

```
# calculate the average speed
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
```

<div>
<p>Next, we'll check for trips with an average speed greater than 20,000 mph:</p>
</div>

```
# create a boolean array for trips with average
# speeds greater than 20,000 mph
trip_mph_bool = trip_mph > 20000

# use the boolean array to select the rows for
# those trips, and the pickup_location_code,
# dropoff_location_code, trip_distance, and
# trip_length columns
trips_over_20000_mph = taxi[trip_mph_bool,5:9]

print(trips_over_20000_mph)
```
```
[[     2      2     23      1]
 [     2      2   19.6      1]
 [     2      2   16.7      2]
 [     3      3   17.8      2]
 [     2      2   17.2      2]
 [     3      3   16.9      3]
 [     2      2   27.1      4]]
```

<div>
<p>We can see from the last column that most of these are very short rides - all have <code>trip_length</code> values of 4 or less seconds, which does not reconcile with the trip distances, all of which are more than 16 miles.</p>
<p>Let's use this technique to examine the rows that have the highest values for the <code>tip_amount</code> column.</p></div>

### Instructions

<ol>
<li>Create a boolean array, <code>tip_bool</code>, that determines which rows have values for the <code>tip_amount</code> column of more than <code>50</code>.</li>
<li>Use the <code>tip_bool</code> array to select all rows from <code>taxi</code> with values tip amounts of more than <code>50</code>, and the columns from indexes 5 to 13 inclusive. Assign the resulting array to <code>top_tips</code>.</li>
</ol>

In [5]:
tip_amount = taxi[:,12]

tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5:14]


# Assigning values to Ndarrays

<div><p>So far, we've learned how to retrieve data from ndarrays. Next, we'll use the same indexing techniques we've already learned to modify values within an ndarray.  The syntax we'll use (in pseudocode) is:</p>
</div>

```
ndarray[location_of_values] = new_value
```

<div>
<p>Let's take a look at what that looks like in actual code.  With our 1D array, we can specify one specific index location:</p>
</div>

```
a = np.array(['red','blue','black','blue','purple'])
a[0] = 'orange'
print(a)
```
```
['orange', 'blue', 'black', 'blue', 'purple']
```

<div>
<p>Or we can assign multiple values at once:</p>
</div>

```
a[3:] = 'pink'
print(a)
```
```
['orange', 'blue', 'black', 'pink', 'pink']
```

<div>
<p>With a 2D ndarray, just like with a 1D ndarray, we can assign one specific index location:</p>
</div>

```
ones = np.array([[1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1]])
ones[1,2] = 99
print(ones)
```
```
[[ 1,  1,  1,  1,  1],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]
```

<div>
<p>We can also assign a whole row...</p>
</div>

```
ones[0] = 42
print(ones)
```
```
[[42, 42, 42, 42, 42],
 [ 1,  1, 99,  1,  1],
 [ 1,  1,  1,  1,  1]]
```

<div>
<p>...or a whole column:</p>
</div>

```
ones[:,2] = 0
print(ones)
```
```
[[42, 42, 0, 42, 42],
 [ 1,  1, 0,  1,  1],
 [ 1,  1, 0,  1,  1]]
```

<div>
<p>Let's practice some array assignment with our taxi dataset.</p></div>

### Instructions 

<p>To help you practice without making changes to our original array, we have used the <a href="http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.copy" target="_blank"><code>ndarray.copy()</code> method</a> to make <code>taxi_modified</code>, a copy of our original for these exercises.</p>

<ol>
<li>The value at column index <code>5</code> (pickup_location) of row index <code>1066</code> is incorrect.  Use assignment to change this value to <code>1</code> in the <code>taxi_modified</code> ndarray.</li>
<li>The first column (index <code>0</code>) contains year values as four digit numbers in the format YYYY (<code>2016</code>, since all trips in our data set are from 2016).  Use assignment to change these values to the YY format (<code>16</code>) in the <code>taxi_modified</code> ndarray.</li>
<li>The values at column index <code>7</code> (trip_distance) of rows index <code>550</code> and <code>551</code> are incorrect. Use assignment to change these values in the <code>taxi_modified</code> ndarray to the mean value for that column.</li>
</ol>

In [6]:
# this creates a copy of our taxi ndarray
taxi_modified = taxi.copy()

taxi_modified[1066, 5] = 1
taxi_modified[:, 0] = 16
taxi_modified[[550, 551], 7] = taxi_modified[:, 7].mean()

# Assignment using Boolean arrays

<div><p>Boolean arrays become very powerful when we use them for assignment. Let's look at an example:</p>
</div>

```
a2 = np.array([1, 2, 3, 4, 5])

a2_bool = a2 > 2

a2[a2_bool] = 99

print(a2)
```
```
[ 1  2 99 99 99]
```

<div>
<p>The boolean array controls the values that the assignment applies to, and the other values remain unchanged. Let's look at how this code works:</p>
<p><img src="https://s3.amazonaws.com/dq-content/290/bool_assignment_1.svg" alt="Boolean assignment example 1"></p>
<p>Notice in the diagram above that we used a "shortcut" - we inserted the definition of the boolean array directly into the selection. This "shortcut" is the conventional way to write boolean indexing. Up until now, we've been assigning to an intermediate variable first so that the process is clear, but from here on, we will use this "shortcut" method instead. </p></div>

### Instructions

<p>We again used the <a href="http://docs.scipy.org/doc/numpy-1.14.2/reference/generated/numpy.ndarray.copy.html#numpy.ndarray.copy" target="_blank"><code>ndarray.copy()</code> method</a> to make <code>taxi_copy</code>, a copy of our original for this exercise.</p>

<ol>
<li>Select the fourteenth column (index 13) in <code>taxi_copy</code>. Assign it to a variable named <code>total_amount</code>.</li>
<li>For rows where the value of <code>total_amount</code> is less than <code>0</code>, use assignment to change the value to <code>0</code>.</li>
</ol>

In [7]:
# this creates a copy of our taxi ndarray
taxi_copy = taxi.copy()

total_amount = taxi_copy[:, 13]
taxi_copy[total_amount < 0] = 0

<div><p>Next, we'll look at an example of assignment using a boolean array with <em>two</em> dimensions:</p>
<p><img src="https://s3.amazonaws.com/dq-content/290/bool_assignment_2.svg" alt="Boolean assignment example 2"></p>
<p>The <code>b &gt; 4</code> boolean operation produces a 2D boolean array which then controls the values that the assignment applies to.</p>
<p>We can also use a 1D boolean array to perform assignment on a 2D array:</p>
<p><img src="https://s3.amazonaws.com/dq-content/290/bool_assignment_3.svg" alt="Boolean assignment example 3"></p>
<p>The <code>c[:,1] &gt; 2</code> boolean operation compares just one column's values and produces a 1D boolean array. We then use that boolean array as the row index for assignment, and <code>1</code> as the column index to specify the second column. Our boolean array is only applied to the second column, while all other values remaining unchanged.</p>
<p>The pseudocode syntax for this code is as follows, first using an intermediate variable:</p>
</div>

```

```
```

```

<div>
<p>Let's practice this pattern using our taxi data set:</p></div>

### Instructions

<p>We have created a new copy of our taxi dataset, <code>taxi_modified</code> with an additional column containing the value <code>0</code> for every row.</p>

<ol>
<li>In our new column at index <code>15</code>, assign the value <code>1</code> if the <code>pickup_location_code</code> (column index <code>5</code>) corresponds to an airport location, leaving the value as <code>0</code> otherwise by performing these three operations:<ul>
<li>For rows where the value for the column index <code>5</code> is equal to <code>2</code> (JFK Airport), assign the value <code>1</code> to column index <code>15</code>.</li>
<li>For rows where the value for the column index <code>5</code> is equal to <code>3</code> (LaGuardia Airport), assign the value <code>1</code> to column index <code>15</code>.</li>
<li>For rows where the value for the column index <code>5</code> is equal to <code>5</code> (Newark Airport), assign the value <code>1</code> to column index <code>15</code>.</li>
</ul>
</li>
</ol>

In [8]:
# create a new column filled with `0`.
zeros = np.zeros([taxi.shape[0], 1])
taxi_modified = np.concatenate([taxi, zeros], axis=1)
print(taxi_modified)

taxi_modified[taxi_modified[:, 5] == 2, 15] = 1
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1
taxi_modified[taxi_modified[:, 5] == 5, 15] = 1

[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 0.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 0.000e+00]]


# Challenge: Which is the most popular airport? 

<div><p>We'll conclude this mission with two challenges. Challenges are designed to help you practice the techniques you've learned in this mission.</p>
<p>We supplied several hints to help you, but first try to complete the challenge without the hints, if you can.  Don't be discouraged if these challenge steps take a few attempts to get right – working with data is an iterative process!</p>
<p>In this challenge, we want to figure out which airport is the most popular destination in our data set. To do that, we'll use boolean indexing to create three filtered arrays and then look at how many rows are in each array. </p>
<p>To complete this task, we'll need to check if the <code>dropoff_location_code</code> column (column index <code>6</code>) is equal to one of the following values:</p>
<ul>
<li><code>2</code>: JFK Airport</li>
<li><code>3</code>: LaGuardia Airport</li>
<li><code>5</code>: Newark Airport.</li>
</ul></div>

### Instructions 

<ol>
<li>Using the original <code>taxi</code> ndarray, calculate how many trips had JFK Airport as their destination:<ul>
<li>Use boolean indexing to select only the rows where the <code>dropoff_location_code</code> column (column index <code>6</code>) has a value that corresponds to JFK. Assign the result to <code>jfk</code>.</li>
<li>Calculate how many rows are in the new <code>jfk</code> array and assign the result to <code>jfk_count</code>.</li>
</ul>
</li>
<li>Calculate how many trips from <code>taxi</code> had Laguardia Airport as their destination:<ul>
<li>Use boolean indexing to select only the rows where the <code>dropoff_location_code</code> column (column index <code>6</code>) has a value that corresponds to Laguardia. Assign the result to <code>laguardia</code>.</li>
<li>Calculate how many rows are in the new <code>laguardia</code> array. Assign the result to <code>laguardia_count</code>.</li>
</ul>
</li>
<li>Calculate how many trips from taxi had Newark Airport as their destination:<ul>
<li>Select only the rows where the <code>dropoff_location_code</code> column has a value that corresponds to Newark, and assign the result to <code>newark</code>.</li>
<li>Calculate how many rows are in the new <code>newark</code> array and assign the result to <code>newark_count</code>.</li>
</ul>
</li>
<li>After you have run your code, inspect the values for <code>jfk_count</code>, <code>laguardia_count</code>, and <code>newark_count</code> and see which airport has the most dropoffs.</li>
</ol>


In [9]:
jfk = taxi[taxi[:, 6] == 2, :]
jfk_count = jfk.shape[0]

laguardia = taxi[taxi[:, 6] == 3, :]
laguardia_count = laguardia.shape[0]

newark = taxi[taxi[:, 6] == 5, :]
newark_count = newark.shape[0]

# Challenge: Calculating stats for trips on clean data

<div><p>Our calculations in the previous screen show that Laguardia is the most common airport for dropoffs in our data set.</p>
<p>Our second and final challenge involves removing potentially bad data from our data set, and then calculating some descriptive statistics on the remaining "clean" data.</p>
<p>We'll start by using boolean indexing to remove any rows that have an average speed for the trip greater than 100 mph (160 kph) which should remove the questionable data we have worked with over the past two missions.  Then, we'll use array methods to calculate the mean for specific columns of the remaining data.  The columns we're interested in are:</p>
<ul>
<li><code>trip_distance</code>, at column index <code>7</code></li>
<li><code>trip_length</code>, at column index <code>8</code></li>
<li><code>total_amount</code>, at column index <code>13</code></li>
</ul></div>

### Instructions 

<p>The <code>trip_mph</code> ndarray has been provided for you.</p>
<ol>
<li>Create a new ndarray, <code>cleaned_taxi</code>, containing only rows for which the values of <code>trip_mph</code> are less than 100.</li>
<li>Calculate the mean of the <code>trip_distance</code> column of <code>cleaned_taxi</code>. Assign the result to <code>mean_distance</code>.</li>
<li>Calculate the mean of the <code>trip_length</code> column of <code>cleaned_taxi</code>. Assign the result to <code>mean_length</code>.</li>
<li>Calculate the mean of the <code>total_amount</code> column of <code>cleaned_taxi</code>. Assign the result to <code>mean_total_amount</code>.</li>
</ol>

In [10]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

cleaned_taxi = taxi[trip_mph < 100]
mean_distance = cleaned_taxi[:, 7].mean()
mean_length = cleaned_taxi[:, 8].mean()
mean_total_amount = cleaned_taxi[:, 13].mean()