# CSV
CSV are comma separated files that contain row level data. They can generally be viewed and manipulated in Excel or a text editor, however reading and manipulating them in python is much more efficient with a large open source library such as `pandas`.

Data in CSV files is typically split by a 'token' or 'delimiter' such as a tab (\t), pipe (|) or comma (,). Other variation of CSV will be denoted by `.psv` or `.tsv` instead of `csv`. In order to handle these variations with `pandas` you only need to add a `delimiter` parameter in the `pandas.read_csv()` method. The default delimiter is a comma, so normally you don't have to worry about this setting. 

CSV files also may or may not come with a header, if they dont you will need to set the `header=None` when called `pandas.read_csv()`.

In [1]:
import pandas

### Set the csv url (this can also be a path on the file system)

In [2]:
url = 'https://raw.githubusercontent.com/danielc92/csv-data/master/seaborn-data/planets.csv'

### Read the data in
Storing the csv data in a `pandas.DataFrame` object using `pandas.read_csv()` method, this method can take in a path or url.

In [3]:
data = pandas.read_csv(url)

### Inspect snippet of data
You can inspect the first 5 rows by calling the `.head()` function on a pandas DataFrame. You can also view the last rows using `tail()`. You can also set a number of rows eg `.head(30)`

In [4]:
data.head()

Unnamed: 0,rowid,pl_discmethod,pl_pnum,pl_orbper,pl_msinij,st_dist,pl_disc
0,1,Radial Velocity,1,269.3,7.1,77.4,2006
1,2,Radial Velocity,1,874.774,2.21,56.95,2008
2,3,Radial Velocity,1,763.0,2.6,19.84,2011
3,4,Radial Velocity,1,326.03,19.4,110.62,2007
4,5,Radial Velocity,1,516.22,10.5,119.47,2009


### Check data dimensions
This property returns a tuple with the number of rows followed by the number of columns. This does not include the row level index.

In [5]:
data.shape

(1035, 7)

### Check data headers

In [6]:
list(data.columns)

['rowid',
 'pl_discmethod',
 'pl_pnum',
 'pl_orbper',
 'pl_msinij',
 'st_dist',
 'pl_disc']

### Check null composition
You can use `info()` method to check the null composition, and datatypes of each column

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 7 columns):
rowid            1035 non-null int64
pl_discmethod    1035 non-null object
pl_pnum          1035 non-null int64
pl_orbper        992 non-null float64
pl_msinij        513 non-null float64
st_dist          808 non-null float64
pl_disc          1035 non-null int64
dtypes: float64(3), int64(3), object(1)
memory usage: 56.7+ KB


### Statistical summary
You can build a quick statistical summary by calling `.describe()` on your `pandas.DataFrame` object.

In [8]:
data.describe()

Unnamed: 0,rowid,pl_pnum,pl_orbper,pl_msinij,st_dist,pl_disc
count,1035.0,1035.0,992.0,513.0,808.0,1035.0
mean,518.0,1.785507,2002.917596,2.638161,264.069282,2009.070531
std,298.923067,1.240976,26014.728304,3.818617,733.116493,3.972567
min,1.0,1.0,0.090706,0.0036,1.35,1989.0
25%,259.5,1.0,5.44254,0.229,32.56,2007.0
50%,518.0,1.0,39.9795,1.26,55.25,2010.0
75%,776.5,2.0,526.005,3.04,178.5,2012.0
max,1035.0,7.0,730000.0,25.0,8500.0,2014.0


### DataFrame composition
The DataFrame object has a row index and column index (headers). These can be used to access cell level values of the DataFrame, they can also be used to aggregate and filter. You can use `.loc` to locate a particular cell

In [9]:
data.loc[2, 'pl_disc']

2011

### Slicing horizontally
To slice a `pandas.DataFrame` object horizontally you can specify the upper and lower limits (row index)

In [15]:
data[2:7]

Unnamed: 0,rowid,pl_discmethod,pl_pnum,pl_orbper,pl_msinij,st_dist,pl_disc
2,3,Radial Velocity,1,763.0,2.6,19.84,2011
3,4,Radial Velocity,1,326.03,19.4,110.62,2007
4,5,Radial Velocity,1,516.22,10.5,119.47,2009
5,6,Radial Velocity,1,185.84,4.8,76.39,2008
6,7,Radial Velocity,1,1773.4,4.64,18.15,2002


### Slicing vertically
This will return a `pandas.Series`. It can be iterated through like a python list. You can also conver it to a python list using `list(pandas.Series)`. To filter by multiple columns you have to pass in a list containing the headers of interest.

In [13]:
data['pl_disc']

0       2006
1       2008
2       2011
3       2007
4       2009
5       2008
6       2002
7       1996
8       2008
9       2010
10      2010
11      2009
12      2008
13      1996
14      2001
15      2009
16      1995
17      1996
18      2004
19      2002
20      2011
21      2007
22      2009
23      2009
24      2009
25      1996
26      2012
27      2008
28      2013
29      2005
        ... 
1005    2012
1006    2012
1007    2012
1008    2012
1009    2012
1010    2012
1011    2012
1012    2012
1013    2012
1014    2012
1015    2012
1016    2013
1017    2012
1018    2012
1019    2012
1020    2012
1021    2013
1022    2012
1023    2012
1024    2012
1025    2012
1026    2014
1027    2011
1028    2012
1029    2012
1030    2006
1031    2007
1032    2007
1033    2008
1034    2008
Name: pl_disc, Length: 1035, dtype: int64

In [11]:
data[['pl_disc', 'st_dist']]

Unnamed: 0,pl_disc,st_dist
0,2006,77.40
1,2008,56.95
2,2011,19.84
3,2007,110.62
4,2009,119.47
5,2008,76.39
6,2002,18.15
7,1996,21.41
8,2008,73.10
9,2010,74.79


### Filtering
Filtering a `pandas.DataFrame` object is pretty straight forward. You can filter by a single condition. Conversely, you can filter by multiple conditions specifying the | or & to indicate whether the conditions are **or** or **and**

In [28]:
data[(data['st_dist'] >= 3000) | (data['pl_discmethod'] == 'Pulsar Timing')]

Unnamed: 0,rowid,pl_discmethod,pl_pnum,pl_orbper,pl_msinij,st_dist,pl_disc
905,906,Microlensing,1,,,3600.0,2013
911,912,Microlensing,1,,,7720.0,2012
912,913,Microlensing,1,,,7560.0,2013
925,926,Microlensing,2,,,4080.0,2012
926,927,Microlensing,2,,,4080.0,2012
928,929,Microlensing,1,,,4970.0,2013
941,942,Pulsar Timing,3,25.262,,,1992
942,943,Pulsar Timing,3,66.5419,,,1992
943,944,Pulsar Timing,3,98.2114,,,1994
944,945,Pulsar Timing,1,36525.0,,,2003


In [31]:
data[(data['st_dist'] >= 1200) & (data['pl_discmethod'] == 'Pulsar Timing')]

Unnamed: 0,rowid,pl_discmethod,pl_pnum,pl_orbper,pl_msinij,st_dist,pl_disc
945,946,Pulsar Timing,1,0.090706,,1200.0,2011
