# Read/Save csv data files using Pandas

We can use the Python package Pandas to <br>
(1) read and save data files. <br>
(2) visualize data, and the visualization functions in Pandas are built on Matplotlib.  <br>
(3) combine different data files (dataframes) into one file (dataframe). <br>
<br>
Pandas uses a data type ``dataframe`` to represent a table, which is an enhanced version of NumPy array (usually 2D).
<br>
A table can be stored in a ``dataframe``. <br>
Each column of the dataframe has an index (a string or an integer or other object)  <br>
Each row of the dataframe has an index (a string or an integer or other object)  <br>
<br>
Pandas has another data type  ``Series``  , which is an enhanced version of 1D NumPy array. <br>
Each element in a ``Series`` has an index (a string or an integer or other object)

In [1]:
import numpy as np
import pandas as pd

## Series Data Object in Pandas

In [2]:
data = pd.Series([0.1, 0.2, 0.3, 0.4])
data

0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

In [3]:
type(data)

pandas.core.series.Series

We can get an element of a ``Series``  using its index

In [4]:
data[0]

0.1

We can get a sub-series using element indexes in a ``Series``

In [5]:
a = data[0:2] # type(a) is pandas.core.series.Series
a

0    0.1
1    0.2
dtype: float64

We can convert ``Series`` into a 1D NumPy array using the function/method ``.values``

In [6]:
a = data.values # type(a) is numpy.ndarray
a

array([0.1, 0.2, 0.3, 0.4])

### ``Series`` is similar to Python Dictionary and NumPy array

Each element in a ``Series`` has an index (usually, a string-index or an integer-index) <br>
We can acess an element using an integer-index : similar to NumPy Array <br>
We can acess an element using a string-index : similar to Python Dictionary

In [7]:
data = pd.Series([0.1, 0.2, 0.3, 0.4], index=['a', 'b', 'c', 'd'])
data

a    0.1
b    0.2
c    0.3
d    0.4
dtype: float64

A ``Series`` has an attribute ``index``, which is an array-like object

In [8]:
# get the string-index of each element
[data.index[0], data.index[1], data.index[2], data.index[3]]

['a', 'b', 'c', 'd']

Get the element using the string-index

In [9]:
data['b']

0.2

Get the element using the integer-index

In [10]:
data[1]

0.2

We can use non-contiguous indexes in a ``Series``

In [11]:
data = pd.Series([0.1, 0.2, 0.3, 0.4], index=[-1, 100, 2, 3])
data

-1      0.1
 100    0.2
 2      0.3
 3      0.4
dtype: float64

In [12]:
data[-1] # it is not the last element

0.1

In [13]:
data[-1:101] # this is weird, do not use this notation to get a sub-series

3    0.4
dtype: float64

In [14]:
data1 = pd.Series([0.1, 0.2, 0.3, 0.4]) # we do not specify index here
# it is the same as
data2 = pd.Series([0.1, 0.2, 0.3, 0.4], index = [0, 1, 2, 3]) # indexes are contiguous from 0

In [15]:
data1

0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

In [16]:
data2

0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

### Create a Series from a Python Dictionary

In [17]:
patient_info = {'Age': 20,
                'Blood_Type': 'O',
                'sex': 'M',
                'Address': 'Base0, Mars',
                'Phone': '001001001',
                'Diagnosis': 'bone fracture in foot'}
#
patient_info = pd.Series(patient_info)
print(patient_info)
print('type(patient_info) is', type(patient_info))

Age                              20
Blood_Type                        O
sex                               M
Address                 Base0, Mars
Phone                     001001001
Diagnosis     bone fracture in foot
dtype: object
type(patient_info) is <class 'pandas.core.series.Series'>


In [18]:
patient_info[4]

'001001001'

In [19]:
patient_info['Phone']

'001001001'

In [20]:
patient_info[0:4] # patient_info[4]/['Phone'] is not included

Age                    20
Blood_Type              O
sex                     M
Address       Base0, Mars
dtype: object

``Series`` supports slicing using strings as the start index and the end index

In [21]:
patient_info['Age':'Phone']
# ['Phone'] is included: this is inconsistent with the above integer-index notation

Age                    20
Blood_Type              O
sex                     M
Address       Base0, Mars
Phone           001001001
dtype: object

## Dataframe Object in Pandas 
Dataframe is usually used to represent a table <br>
The value of a table is a matrix (2D NumOy Array) <br>
Each row of the table has an index (usually, a string-index or an integer-index) <br>
Each column of the table has an index (usually, a string-index or an integer-index) <br>

In [22]:
Matrix = [[1, 2],
          [3, 4],
          [5, 6]]

In [23]:
df = pd.DataFrame(Matrix, columns=['ColumnA', 'ColumnB'], index=['RowA', 'RowB', 'RowC']) 
print('type(df)', type(df))
df

type(df) <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,ColumnA,ColumnB
RowA,1,2
RowB,3,4
RowC,5,6


In [24]:
type(df)

pandas.core.frame.DataFrame

In [25]:
df.columns

Index(['ColumnA', 'ColumnB'], dtype='object')

In [26]:
df.index

Index(['RowA', 'RowB', 'RowC'], dtype='object')

In [27]:
# get the first column using its identifier/name
df['ColumnA']

RowA    1
RowB    3
RowC    5
Name: ColumnA, dtype: int64

In [28]:
# get the first row by its identifier/name ???
df['RowA'] # this is wrong

KeyError: 'RowA'

get the first row using ```df.iloc``` with integer-index

In [29]:
df.iloc[0,:]

ColumnA    1
ColumnB    2
Name: RowA, dtype: int64

In [30]:
type(df.iloc[0,:])

pandas.core.series.Series

get the first column using df.iloc with integer-index

In [31]:
df.iloc[:,0]

RowA    1
RowB    3
RowC    5
Name: ColumnA, dtype: int64

In [32]:
type(df.iloc[:,0])

pandas.core.series.Series

get an element in the Dataframe using df.iloc with integer-indexes

In [33]:
df.iloc[0,1]

2

In [34]:
df

Unnamed: 0,ColumnA,ColumnB
RowA,1,2
RowB,3,4
RowC,5,6


## Convert a Dataframe to a Numpy Array using ``Dataframe.values``

In [35]:
A = df.values
A

array([[1, 2],
       [3, 4],
       [5, 6]])

In [36]:
type(A)

numpy.ndarray

# Load data from a csv  file

a csv file contains comma-separated values (CSV) <br>
https://en.wikipedia.org/wiki/Comma-separated_values

In [38]:
df = pd.read_csv('patient_record.csv', sep=',') # in the file the numbers are seperated by ,
df

Unnamed: 0,Age,Sex,Tumor_size_mm
0,30,M,1.0
1,40,F,2.0
2,85,F,0.1
3,75,M,1.0
4,95,F,3.0


In [39]:
df.columns

Index(['Age', 'Sex', 'Tumor_size_mm'], dtype='object')

In [None]:
df.index

In [None]:
#convert the dataframe to a numpy array
data=df.values
data

We convert M to 0 and convert F to 1 to get a numeric array

In [None]:
data[np.where(data=='M')]=0
data[np.where(data=='F')]=1
data

### chage the data type  from 'object' to 'float64'

In [None]:
data=data.astype('float64')
data

# Process the data
assume after brain surgeries, the tumors of the male patients have been removed <br>

In [None]:
data_new = data.copy()
# assume we can use one line of code to remove the tumors of the male patients
data_new[:,2]=data_new[:,1]*data_new[:,2]
data_new

In [None]:
data_new = data_new.astype('object') # change the data type from float64 to object (to store str object)
data_new[:,1][np.where(data_new[:,2]==0)]='M'
data_new[:,1][np.where(data_new[:,2]>0)]='F'
data_new[:,0]=np.int64(data_new[:,0])
data_new

# Save the data to a csv file

In [None]:
#create a new Dataframe using data_new and the original column/row indexes
df_new = pd.DataFrame(data_new, columns=df.columns, index=df.index) 
df_new

In [None]:
#save the new Dataframe df_new to a csv file
#set index=False, so the row indexes will not be saved  
df_new.to_csv('patient_record_new.csv', index=False, sep=',')