## DataFrame Manipulation in Python

Fundamental Pandas operations in action.  

The following tasks are performed :

* Import the necessary libraries.
* Read in the abalone data from the UCI machine learning repository.
* Check the type.
* Get the size of the dataframe.
* Get the shape of the dataframe.  
<br>
* Look at the first few data instances.
* Add column names.
* Do a statistical summary.
* Remove half the data with below median height.
* Get the size of the dataframe.  
<br>
* Remove the length and diameter columns.

#### Import the necessary libraries.

In [1]:
import sys
import pandas as pd
import numpy as np

#### Read in the abalone data from the UCI machine learning repository.

In [2]:
url = "https://raw.githubusercontent.com/basilhan/datasets/master/abalone.data"
data = pd.read_csv(url, header=None)

#### Check the type.

In [3]:
type(data)

pandas.core.frame.DataFrame

#### Get the size of the dataframe.

In [4]:
print(sys.getsizeof(data), "bytes")

526406 bytes


#### Get the shape of the dataframe.

In [5]:
data.shape

(4177, 9)

#### Look at the first few data instances.

In [6]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


#### Add column names.

In [7]:
names = ["Sex", "Length", "Diam", "Height", "Whole", "Shucked", "Viscera", "Shell", "Rings"]
data.columns = names
data.head()

Unnamed: 0,Sex,Length,Diam,Height,Whole,Shucked,Viscera,Shell,Rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


#### Do a statistical summary.

In [8]:
data.describe()

Unnamed: 0,Length,Diam,Height,Whole,Shucked,Viscera,Shell,Rings
count,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0,4177.0
mean,0.523992,0.407881,0.139516,0.828742,0.359367,0.180594,0.238831,9.933684
std,0.120093,0.09924,0.041827,0.490389,0.221963,0.109614,0.139203,3.224169
min,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,0.45,0.35,0.115,0.4415,0.186,0.0935,0.13,8.0
50%,0.545,0.425,0.14,0.7995,0.336,0.171,0.234,9.0
75%,0.615,0.48,0.165,1.153,0.502,0.253,0.329,11.0
max,0.815,0.65,1.13,2.8255,1.488,0.76,1.005,29.0


#### Remove half the data with below median height.

In [9]:
data = data[data.Height > data.Height.median()]
data.shape

(2072, 9)

#### Get the size of the dataframe.

In [10]:
print(sys.getsizeof(data), "bytes")

277672 bytes


#### Remove the length and diameter columns.

In [11]:
data = data.drop(["Length", "Diam"], 1)  # 0 for rows, 1 for columns
data.head()

Unnamed: 0,Sex,Height,Whole,Shucked,Viscera,Shell,Rings
6,F,0.15,0.7775,0.237,0.1415,0.33,20
9,F,0.15,0.8945,0.3145,0.151,0.32,19
13,F,0.145,0.6845,0.2725,0.171,0.205,10
22,F,0.155,0.9395,0.4275,0.214,0.27,12
24,F,0.165,1.1615,0.513,0.301,0.305,10


Permalink : https://github.com/basilhan/python/blob/master/PythonDataFrameManipulation.ipynb