## ☑️ Part 1: Importing, examining, and updating values in a dataset

- Complete the following questions
- Make sure you run the following code cells before you attempt any of the questions
- First, import Pandas Library for analysing and manipulating data 

In [6]:
import pandas as pd

Now use the `.read_csv()` function to import the file `libraries.csv` from the `data` folder, assigning the result to `df`:

In [7]:
df = pd.read_csv('data/libraries.csv')

### Exploratory data analysis

**Q1)** Check that the file has been imported as expected using the `.head()` method to look at the first `3` rows of `df` dataframe:

In [8]:
#add your code below

#hide
df.head(3)
#/hide

**Q2)** Use the `.info()` method on `df` to look at structure of the dataframe: 

In [9]:
#add your code below

#hide
df.info()
#/hide

**Q3)** Use the `.describe()` method on `df` to look at summary statistics for the numerical data columns:

In [13]:
#add your code below

#hide
df.describe()
#/hide

Notice that the value for `min` in both columns is negative, which suggests an error in the data.

**Q4)** You're told that there is an error in the row with index `2233`, use `.loc[]` method to have a look at this row and further investigate the error:

In [14]:
#add your code below

#hide
df.loc[2233]
#/hide

**Q5)** Using `.loc[]` method update the values for both `Weekly hours open` and `Weekly hours staffed` with a value of `57`.

In [12]:
#add your code below

#hide
df.loc[2233, ['Weekly hours open', 'Weekly hours staffed']] = 57
#/hide

Once updated, re-run your code for the previous questions `Q3` and `Q4` to check that both values on the row at index `2233` have been updated. 

The `min` value for both `Weekly hours open` and `Weekly hours staffed` should now be `0`:

## ☑️ Part 2: Conditional filtering and modifying DataFrames
- Complete the following questions
- Make sure you re-run all the above code cells before you attempt any of the following questions

**Q6)** Use the `.set_index()` method with `inplace=True` parameter, to set the values from the `Library name` column as the index of `df`

In [15]:
#add your code below

#hide
df.set_index('Library name', inplace=True)
df.head()
#/hide

**Q7)** Use the `.drop()` method with `axis=1` and `inplace=True` parameters, to remove the `Notes` column from `df`

In [16]:
#add your code below

#hide
df.drop('Notes', axis=1, inplace=True)
df.head()
#/hide

### Some additional data preparation is needed before we continue with the rest of the questions. You will learn these techniques in more detail in the next module.

Let's use the `.value_counts()` method on `df['In use 2010']` and `df['In use 2016']`to see what different values there are in those columns:

In [17]:
df['In use 2010'].value_counts()

In [18]:
df['In use 2016'].value_counts()

In the cell below, we have defined a function called **`is_open()`**, which takes a single value and returns a boolean value as follows:
- `True` if the value equals `'yes'` or `'Yes'`
- `False` for any other value (`'no'` or `'No'`)

Run the cell to make the function available. You do not need to modify it for this exercise!

In [19]:
def is_open(entry):
    
    if entry in ['yes', 'Yes']:
        return True
    else:
        return False

The following cell contains some examples that allow us to test the logic in our function:

In [20]:
is_open('No'), is_open('yes')

Let's use the `.apply()` method with the `is_open()` function and each of the columns `In use 2010` and `In use 2016`, to create new columns called `open_2010` and `open_2016` respectively, each containing Boolean values returned by the function:

In [21]:
df['open_2010'] = df['In use 2010'].apply(is_open)
df['open_2016'] = df['In use 2016'].apply(is_open)

df.head()

**Q8)** Create a new dataframe called `df_open`, which should only contain Boolean `True` entries for both `open_2010` and `open_2016`

- Hint: You can start by creating a Boolean mask

In [22]:
#add your code below

#hide

mask1 = (df['open_2010'] == True) & (df['open_2016'] == True)

df_open = df[mask1]

df_open.head()

#/hide

**Q9)** Calculate the `.mean()` of the values in the `Weekly hours open` column in `df_open` dataframe:

In [23]:
#add your code below

#hide
df_open['Weekly hours open'].mean()
#/hide

**Q10)** Calculate the percentage of entries in `df_open` where `['Weekly hours open'] == 0`:

- Hint: You can make use of `len()` function

In [24]:
#add your code below

#hide
len(df_open[df_open['Weekly hours open'] == 0]) / len(df_open) * 100
#/hide

# Additional notes

## Exploring the impact of `NaN` values

Converting the zero values to `NaN` values will result those values being excluded from subsequent calculations, for example, `.mean()`.

These `NaN` values ('Not a Number') represent 'missing data' (of all types, not just numeric) in pandas.

Let's import numpy library with the **alias** of `np`, this is mainly done to access `np.nan` marker, this marker is used to represent `NaN` values.

In [None]:
import numpy as np

In [None]:
np.nan

**Now let's use `.replace()` to modify the `Weekly hours open` column:**

In [None]:
df_open['Weekly hours open'] = df_open['Weekly hours open'].replace(0, np.nan)

**Let's use the same code as before to calculate the `.mean()` of `Weekly hours open` to see if the value is now different.**

In [None]:
df_open['Weekly hours open'].mean()