# Introduction

In the last tutorial, we learned how to select relevant data out of a DataFrame or Series. Plucking the right data out of our data representation is critical to getting work done, as we demonstrated in the exercises.

However, the data does not always come out of memory in the format we want it in right out of the bat. Sometimes we have to do some more work ourselves to reformat it for the task at hand.  This tutorial will cover different operations we can apply to our data to get the input "just right". 

In [1]:

import pandas as pd
pd.set_option('max_rows', 5)
import numpy as np
df = pd.read_csv("../Datasets/PakistanDroneAttacksWithTemp Ver 9 (October 19, 2017).csv", encoding='cp1252')

In [2]:
df

Unnamed: 0,S#,Date,Time,Location,City,Province,No of Strike,Al-Qaeda,Taliban,Civilians Min,...,Injured Min,Injured Max,Women/Children,Special Mention (Site),Comments,References,Longitude,Latitude,Temperature(C),Temperature(F)
0,1.0,"Friday, June 18, 2004",22:00,Near Wana,south Waziristan,FATA,1.0,,1.0,0.0,...,,,N,Blast occured in courtyard of the house of lon...,Village in Wana,http://archives.dawn.com/2004/06/19/top1.htm,69.9000,33.0333,28.475,83.255
1,2.0,"Sunday, May 08, 2005",23:30,Mir Ali (Near Afghan Border),North Waziristan,FATA,1.0,1.0,,0.0,...,,,N,Drone struck a car driven by local warlord- ki...,Civilian killied was Samiullah Khan who was a ...,http://www.msnbc.msn.com/id/7847008/,70.1455,32.9746,11.475,52.655
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
403,405.0,"Monday, October 16, 2017",,Zero-point,Lower Kurram Agency,FATA,4.0,,5.0,,...,,,N,At least five suspected militants were killed ...,Conflict of Report: Foreign media reported tha...,http://www.thesundaily.my/news/2017/10/18/deat...,,,,
404,,,,,,,,49.0,662.0,1304.0,...,402.0,1329.0,,,,,,,,


# Summary functions

Pandas provides many simple "summary functions" (not an official name) which restructure the data in some useful way. For example, consider the `describe()` method:

In [3]:
df.Time.describe()

count       173
unique       76
top       10:00
freq         10
Name: Time, dtype: object

This method generates a high-level summary of the attributes of the given column. It is type-aware, meaning that its output changes based on the data type of the input. The output above only makes sense for numerical data; for string data here's what we get:

In [4]:
df.Location.describe()

count            402
unique           302
top       Datta Khel
freq              14
Name: Location, dtype: object

If you want to get some particular simple summary statistic about a column in a DataFrame or a Series, there is usually a helpful pandas function that makes it happen. 

For example, to see the mean, we can use the `mean()` function:

In [5]:
df['Temperature(C)'].mean()

16.01928927680796

To see a list of unique values we can use the `unique()` function:

In [6]:
df['Temperature(C)'].unique()

array([ 28.475,  11.475,   7.08 ,   0.535,  10.025,  18.12 ,  25.77 ,
        24.395,  15.325,  13.79 ,   2.18 ,  15.575,  18.32 ,  16.21 ,
        27.24 ,  27.975,  27.47 ,  26.855,  25.555,  25.235,  24.175,
        24.255,  24.8  ,  22.645,  20.55 ,  25.485,  22.615,  24.64 ,
        24.585,  19.255,  22.595,  15.875,  16.47 ,  15.24 ,  17.5  ,
        16.685,   8.815,  16.985,   8.75 ,  13.46 ,  12.465,   7.   ,
         7.655,   5.69 ,   6.505,   9.365,   3.055,  20.   ,  14.21 ,
        14.8  ,  18.315,  12.34 ,  13.   ,  17.275,  15.57 ,  11.065,
        20.995,  23.6  ,  18.865,  22.4  ,  22.33 ,  18.075,  17.425,
        21.67 ,  20.64 ,  25.65 ,  26.815,  23.24 ,  26.62 ,  26.395,
        25.44 ,  23.805,  24.155,  21.295,  20.8  ,  22.705,  24.19 ,
        22.16 ,  22.43 ,  14.825,  11.66 ,  19.145,  14.735,   5.025,
         8.23 ,   3.95 ,   3.005,   3.53 ,   2.82 ,   3.46 ,   5.87 ,
         4.855,   6.26 ,   4.175,   3.58 ,   3.43 ,   5.63 ,   5.57 ,
         6.12 ,   1.

In [7]:
df['Time'].unique()

array(['22:00', '23:30', nan, '3:00', '10:30', '4:00', '2:00', '15:35',
       '20:00', '20:30', '19:00', '16:30', '15:00', '17:00', '9:00',
       '10:20', '5:30', '16:00', '11:30', '10:00', '1:45', '4:30', '7:45',
       '22:30', '19:45', '20:15', '8:00', '6:30', '15:50', '23:45',
       '6:05', '23:00', '19:30', '3:40', '1:30', '6:20', '3:50', '15:30',
       '8:50', '11:00', '18:30', '17:30', '3:30', '14:00', '21:20',
       '12:45', '21:00', '21:30', '6:00', '12:00', '7:30', '9:30', '7:00',
       '6:15', '17:40', '17:35', '14:45', '17:45', '23:15', '5:45',
       '18:00', '22:45', '6:45', '7:15', '8:30', '3:15', '20:45', '13:30',
       '16:40', '12:15', '2:30', '1:00', '5:00', '4:45', '17:50', '12:30',
       '15:45'], dtype=object)

To see a list of unique values _and_ how often they occur in the dataset, we can use the `value_counts()` method:

In [8]:
df.Time.value_counts()

10:00    10
12:00     7
         ..
17:40     1
15:45     1
Name: Time, Length: 76, dtype: int64