# Data Manipulation with Pandas



![logo](logo_thumbnail.png)

*Data Science @ SC*

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/datascienceucsc/workshops/blob/master/f2020/data-manipulation/data_manipulation.ipynb)

## DrivenData Water Table: Problem Statement

Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? This is an intermediate-level practice competition. Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

## Our Goal

In this notebook, we'll be doing an initial data exploration, where you will learn about common statistical plots and what to look for when you first start a machine learning competition.

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.set()

## Loading the data

Read in the data using `pandas`

In [None]:
X_train = pd.read_csv("train_features.csv")
y_train = pd.read_csv("train_labels.csv")
X_train['status_group'] = y_train["status_group"]

## Feature types

View feature descriptions on the [competition page](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/#features_list)

In [None]:
X_types = X_train.dtypes
X_types

Make lists of categorical and numerical variables. This will be useful for making visualizations

In [None]:
X_cat = (X_types
    [X_types == "object"]
    .append(pd.Series({"region_code":"int64", "district_code": "int64"}))
    .index
)

X_num = (X_types
    [(X_types == "int64") | (X_types == "float64")]
    .drop(["id", "region_code", "district_code"])
    .index
)

## Series and DataFrames

## Selecting data

 `loc`: name-based

`iloc`: array index based

## Basic Exploration

`head` and `tail`

`describe`

## Values counts

## Missing Values

## Spit-apply-combine

Done using `groupby` and and aggration function

## Joining data