# Pandas Practice - Part 1!

<img src="panda.gif"> 

[gif source](https://25.media.tumblr.com/b268ecc1e1374a7b9386e98ebe623c30/tumblr_mjcpgxpPKL1rjriezo1_400.gif)


`pandas` is a Python package that is widley used for data cleaning and wrangling in Data Science (and beyond!). It's best used on tabular data: think a table with rows and columns. Typically each row is an observation and each column is an attribute of that observation. 

In this example, we will look at data from police stops - where each row is 1 stop, and each column contains so desciption of that stop - like the time the stop look place, the outcome of the stop, etc.

We will explore a few common data cleaning practices using pandas functions like `.drop()`, `.rename()`, `.value_counts()`, and `.astype()` while using pandas documentation to guide us - *remember the using documentation is a critical data science skill* and part of the process!

The dataset here comes from [Kaggle](https://www.kaggle.com/datasets/melihkanbay/police) and is intended to be used to practice using pandas.

In [1]:
# if using Google Collab - uncomment the following lines and run them. 
#
# You'll also need to add the entire path of your data file in your google drive.
#     > You can use the 'files' tab on the left to find your data file, then right click and copy path to get the entire path
#     > Pass the path in "" inside the parameters when you read in the data.


# from google.colab import drive
# drive.mount('/content/drive/',  force_remount=True)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ali-rivera/Python-Support-Hours/blob/main/Week3/Week3_BlankResources.ipynb)

## Read in data

## Drop Redundant Columns

First lets take a look and decide if there are any columns we just don't want. It looks like `driver_age_raw` may be the driver's birth year and age has been added (or calculated) as it's own column. Since this information is redundant, I think we can go ahead and drop `driver_age_raw`.

We can do this using the [.drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) function.

## Rename for easier indexing

Let's rename some of the the columns to be a bit shorter while still being descriptive. We can use the [.rename()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) function to do this.

## Correct dtypes

Before we go any further, let's look at the data type of each column and make sure they are appropriate for the information stored there. There are a few functions we can use to do this, [.info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) gives us the most information. [.dtypes](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) also gives us a list of the data types, if that's all we want.

Notice the "RangeIndex" tells use how many rows we have (91741) and "Data Columns" tells us we have 14 columns(after dropping `driver_age_raw`). The "Non-Null Count" column in this outpust shows us that several columns have null values for several rows. We'll keep that in mind for later. For now, let's look a the type of each column.

`stop_date` and `stop_time` are stored as objects, but storing these as a datetime datatype may serve us better. Let's make a new column called `date_time` and store the date AND time in there using the [.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime) function.

Now let's take a look at `gender`, `race`, `violation`, `outcome`, `search_type`, and `stop_duration`, which I suspect may be able to be stored as categories...

We can use the [.value_counts()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html) function for this, which lists out all the values in a column and the count of how many cells hold each value.

It looks like `arrested` is stored as an object but seems to be True/False. Let's take a look at the values and see if we can store it as a bool...

It looks like `arrested` is stored as an object but seems to be True/False. Let's take a look at the values and see if we can store it as a bool...

## Next Steps

Try some of the following (open ended) things on your own:
- Pick a demographic (a certain age range, a race, a gender) and a metric (stop_duration, time of day, outcome, etc). Try to filter on each of these things in 2 different ways.
- `age` is a numeric value. We may want to see the range of ages we have in our dataset. Find a pandas function to get summary statistics on the age variable.
    - Bonus: make a histogram of the age distribution!
- Make age categorical instead of numerical. Pick any number of categories you'd like and the cutoffs. Hint: this is called 'binning' and there is a function to do it! You could also do this with `time`...
    - suggestions: 3 categories - young, middle_aged, and old; by 10s: 10-19, 20-19, 30-39, etc.
- Pick a demographic and *annonomize* it. For example: replace the `gender` category with A/B instead of M/F.
