This is a python notebook for running a csv file in pandas and observing the various commands in it. The CSV file was saved as a .csv from a dataset from UCI Machine Learning Repository and stored as machine.csv. We begin by importing pandas and numpy packages

In [200]:
import numpy as np
import pandas as pd

In [201]:
X=pd.read_csv('machine.csv',header=None)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 10 columns):
0    209 non-null object
1    209 non-null object
2    209 non-null int64
3    209 non-null int64
4    209 non-null int64
5    209 non-null int64
6    209 non-null int64
7    209 non-null int64
8    209 non-null int64
9    209 non-null int64
dtypes: int64(8), object(2)
memory usage: 16.4+ KB


Notice the argument to read_csv. The function takes many parameters but the header one makes sure none of the headers, such as column names are included. Also, the data types have been specified. The command columns in itself returns all the column names

In [202]:
X.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

The command X.head(n) return the first n rows of the dataset. The commands X.iloc[n] and X.ix[n] returns the nth row as well.

In [203]:
X.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,adviser,32/60,125,256,6000,256,16,128,198,199
1,amdahl,470v/7,29,8000,32000,32,8,32,269,253
2,amdahl,470v/7a,29,8000,32000,32,8,32,220,253
3,amdahl,470v/7b,29,8000,32000,32,8,32,172,253
4,amdahl,470v/7c,29,8000,16000,32,8,16,132,132


In [204]:
X.ix[5]

0    amdahl
1    470v/b
2        26
3      8000
4     32000
5        64
6         8
7        32
8       318
9       290
Name: 5, dtype: object

Further, we can pick out certain rows where a column condition is satisfied. The next cell shows this

In [205]:
X[X[4]>40000]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
8,amdahl,580-5860,23,16000,64000,64,16,32,636,749
9,amdahl,580-5880,23,32000,64000,128,32,64,1144,1238
198,sperry,1100/93,30,8000,64000,96,12,176,915,919
199,sperry,1100/94,30,8000,64000,128,12,176,1150,978


In this dataset, we see that for rudimentary analysis, we do not require the 2nd column since it doesn't contain any information with pattern. Hence we want to exclude that column. This is done by using X.drop('column_name'axis=1) if a header exists else in our case, X.drop(X.column[column nos],axis=1) The axis=1 specifies this operation is to be used on columns

In [206]:
X=X.drop(X.columns[[1]],axis=1)
X.head(3)

Unnamed: 0,0,2,3,4,5,6,7,8,9
0,adviser,125,256,6000,256,16,128,198,199
1,amdahl,29,8000,32000,32,8,32,269,253
2,amdahl,29,8000,32000,32,8,32,220,253


In [207]:
X[0].value_counts()

ibm             32
nas             19
honeywell       13
ncr             13
sperry          13
siemens         12
cdc              9
amdahl           9
burroughs        8
harris           7
dg               7
hp               7
ipl              6
dec              6
magnuson         6
c.r.d            6
formation        5
prime            5
cambex           5
gould            3
nixdorf          3
perkin-elmer     3
apollo           2
wang             2
basf             2
bti              2
microdata        1
four-phase       1
sratus           1
adviser          1
Name: 0, dtype: int64

Again, the above command gives the frequency of all the values in a column. Straight out we can see that the last 4 values contribute only on the name. Hence, we can omit three of them (let one survive!). We modify the data frame to remove these rows based on their names from 1st column

In [208]:
X=X[X[0]!='adviser']
X=X[X[0]!='microdata']
X=X[X[0]!='sratus']
X[0].value_counts()

ibm             32
nas             19
honeywell       13
sperry          13
ncr             13
siemens         12
amdahl           9
cdc              9
burroughs        8
harris           7
hp               7
dg               7
dec              6
ipl              6
magnuson         6
c.r.d            6
formation        5
prime            5
cambex           5
gould            3
perkin-elmer     3
nixdorf          3
basf             2
bti              2
wang             2
apollo           2
four-phase       1
Name: 0, dtype: int64

So, now our dataset has been cleared of 'junk' values, I notice another problem: The brand, although very important, is in the form of string. Which will not work well with any algorithm. Thus, we have to replace these names by numbers. We can use the replace function for this. The syntax is X[field].replace(initial,final,inplace=True) to reflect the changes onto the same column. Further, the unique() function helps in finding the unique values in a column 

In [210]:
Y=X[0].unique()
i=1
for y in Y:
    X[0].replace(y,i,inplace=True)
    i+=1
X.describe()

Unnamed: 0,0,2,3,4,5,6,7,8,9
count,206.0,206.0,206.0,206.0,206.0,206.0,206.0,206.0,206.0
mean,15.597087,204.849515,2896.31068,11880.563107,24.330097,4.640777,17.223301,105.800971,99.451456
std,7.05503,262.015754,3898.714252,11789.26899,37.523947,6.813694,23.81368,161.7451,155.607559
min,1.0,17.0,64.0,64.0,0.0,0.0,0.0,6.0,15.0
25%,10.0,50.0,768.0,4000.0,0.0,1.0,5.0,27.0,28.0
50%,17.0,110.0,2000.0,8000.0,8.0,2.0,8.0,49.5,45.5
75%,20.0,225.0,4000.0,16000.0,32.0,6.0,24.0,112.5,100.5
max,27.0,1500.0,32000.0,64000.0,256.0,52.0,176.0,1150.0,1238.0


And with that, our dataset has properly been set up for running any algorithm. We can separate the columns into numpy arrays and we can make prediction using multiple linear regression.