### DataFrame

Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame` in Pandas is indeed a tabular data structure that represents data in a two-dimensional format, similar to a spreadsheet or a SQL table. It is a key data structure provided by the Pandas library for data manipulation and analysis.While a `DataFrame` is inherently two-dimensional, it allows for the representation and manipulation of higher-dimensional data through techniques like hierarchical indexing (MultiIndex) and stacking/unstacking.

In [1]:
import pandas as pd
import numpy as np

# Set some Pandas options
pd.options.display.max_columns = 30
#pd.options.display.max_rows = 20

In [2]:
# Create an empty DataFrame
empty_df = pd.DataFrame()

# Add data to the empty DataFrame
empty_df['value'] = [632, 1638, 569, 115, 433, 1130, 754, 555]
empty_df['patient'] = [1, 1, 1, 1, 2, 2, 2, 2]
empty_df['phylum'] = ['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']

empty_df

Unnamed: 0,value,patient,phylum
0,632,1,Firmicutes
1,1638,1,Proteobacteria
2,569,1,Actinobacteria
3,115,1,Bacteroidetes
4,433,2,Firmicutes
5,1130,2,Proteobacteria
6,754,2,Actinobacteria
7,555,2,Bacteroidetes


In [3]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2], #[1]*4+[2]*4
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

Unnamed: 0,value,patient,phylum
0,632,1,Firmicutes
1,1638,1,Proteobacteria
2,569,1,Actinobacteria
3,115,1,Bacteroidetes
4,433,2,Firmicutes
5,1130,2,Proteobacteria
6,754,2,Actinobacteria
7,555,2,Bacteroidetes


Notice the `DataFrame` is may be sorted by column name. We can change the order by indexing them in the order we desire:

In [4]:
data.sort_index(axis=1) #ascending = True

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [5]:
data

Unnamed: 0,value,patient,phylum
0,632,1,Firmicutes
1,1638,1,Proteobacteria
2,569,1,Actinobacteria
3,115,1,Bacteroidetes
4,433,2,Firmicutes
5,1130,2,Proteobacteria
6,754,2,Actinobacteria
7,555,2,Bacteroidetes


In [6]:
d1 = data[["phylum", "patient"]]
d1

Unnamed: 0,phylum,patient
0,Firmicutes,1
1,Proteobacteria,1
2,Actinobacteria,1
3,Bacteroidetes,1
4,Firmicutes,2
5,Proteobacteria,2
6,Actinobacteria,2
7,Bacteroidetes,2


In [7]:
data[['phylum','value','patient']]

Unnamed: 0,phylum,value,patient
0,Firmicutes,632,1
1,Proteobacteria,1638,1
2,Actinobacteria,569,1
3,Bacteroidetes,115,1
4,Firmicutes,433,2
5,Proteobacteria,1130,2
6,Actinobacteria,754,2
7,Bacteroidetes,555,2


In [8]:
data["patient"].dtype

dtype('int64')

In [9]:
#O: It represents the "object" data type in Pandas, which can hold any Python object, including 
    #strings, lists, dictionaries, and more.

data["phylum"].dtype

dtype('O')

In [10]:
if isinstance(data["phylum"].dtype, object):
    print("Yes")
else:
    print("No")

Yes


In [12]:
#isinstance(data["patient"].dtype, int) checks if the dtype object itself is an instance of int, 
#which is not true because data["patient"].dtype is a NumPy dtype object.
#data["patient"].dtype returns a dtype object like int64, not the standard Python int.

if isinstance(data["patient"].dtype, int):
    print("Yes")
else:
    print("No")

No


In [13]:
if np.issubdtype(data["patient"].dtype, np.integer): #int
    print("Yes")
else:
    print("No")

Yes


In [14]:
# Check if the 'patient' column is of integer dtype
if pd.api.types.is_integer_dtype(data['patient']):
    print("Yes")
else:
    print("No")

Yes


In [15]:
# Check if the 'patient' column is of dtype int64
if data['patient'].dtype == np.int64:
    print("Yes")
else:
    print("No")

Yes


In [16]:
if isinstance(data["value"].dtype, object):
    print("Yes")
else:
    print("No")

Yes


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   value    8 non-null      int64 
 1   patient  8 non-null      int64 
 2   phylum   8 non-null      object
dtypes: int64(2), object(1)
memory usage: 324.0+ bytes


In [17]:
data.shape

(8, 3)

In [18]:
data.columns.tolist()

['value', 'patient', 'phylum']

In [19]:
cols = data.columns.tolist()
print("cols:", cols)
obj = []
for col in cols:
    if data[col].dtype == int: #"O" - to return object type # isinstance(data[col].dtype, object)-to return all types
        obj.append(col) 
obj

cols: ['value', 'patient', 'phylum']


['value', 'patient']

#### NaN(Not a Number) vs None
NaN:

- Float type.
- Used specifically for numerical calculations.
- Operations result in NaN.
- Not equal to itself.

None:

- NoneType.
- General-purpose null value in Python.
- Operations can result in errors unless handled explicitly.
- Equal to itself.

In [19]:
np.nan*5

nan

In [20]:
None*5

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [21]:
print(np.NaN != np.NaN)
print(np.NaN == np.NaN)
print(None != None)
print(None == None)

True
False
False
True


In [22]:
s = pd.Series([1, 2, np.nan, 4])

# Arithmetic operation
result = s + 2
print(result)

0    3.0
1    4.0
2    NaN
3    6.0
dtype: float64


In [23]:
s = pd.Series([1, 2, None, 4])
s

0    1.0
1    2.0
2    NaN
3    4.0
dtype: float64

In [24]:
# Arithmetic operation
result = s + 2
print(result)

0    3.0
1    4.0
2    NaN
3    6.0
dtype: float64


In [25]:
s = pd.Series([1, 2, None, 4, 4+3j])
print(s)

0    1.0+0.0j
1    2.0+0.0j
2   N000a000N
3    4.0+0.0j
4    4.0+3.0j
dtype: complex128


In [26]:
df = pd.DataFrame({
    'Name': ['Shakir', 'Maharram', None, 'Orxan']
})
print(df)

try:
    df['Name_upper'] = df['Name'].str.upper()
except AttributeError as e:
    print("Error:", e)

       Name
0    Shakir
1  Maharram
2      None
3     Orxan


In [27]:
df

Unnamed: 0,Name,Name_upper
0,Shakir,SHAKIR
1,Maharram,MAHARRAM
2,,
3,Orxan,ORXAN


In [28]:
df = pd.DataFrame({
    'Name': ['Shakir', 'Maharram',np.nan, 'Orxan']
})
print(df)

try:
    df['Name_upper'] = df['Name'].str.upper()
except AttributeError as e:
    print("Error:", e)

       Name
0    Shakir
1  Maharram
2       NaN
3     Orxan


##### The `apply()` function is used to apply a function along either axis (rows or columns) of a DataFrame or a Series. 

In [30]:
df = pd.DataFrame({
    'Name': ['Ahmed', 'Tofiq', None, 'Davud']
})

print("Original DataFrame:")
print(df)

# Custom function that does not handle None values
def length(x):
    return len(x)

# Apply the function to the 'Name' column
try:
    df['Name_length'] = df['Name'].apply(length)
except TypeError as e:
    print("Error:", e)

Original DataFrame:
    Name
0  Ahmed
1  Tofiq
2   None
3  Davud
Error: object of type 'NoneType' has no len()


In [31]:
def length_safe(x):
    if x is None:
        return 0
    return len(x)

# Apply the function to the 'Name' column
df['Name_length'] = df['Name'].apply(length_safe)
print("DataFrame after applying safe function:")
print(df)

DataFrame after applying safe function:
    Name  Name_length
0  Ahmed            5
1  Tofiq            5
2   None            0
3  Davud            5


In [32]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555, None,687],
                     'patient':[1, 1, 1, 1, 2, 2, 2, None, 2,2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria',None, 'Actinobacteria', 'Bacteroidetes',None]})
data

Unnamed: 0,value,patient,phylum
0,632.0,1.0,Firmicutes
1,1638.0,1.0,Proteobacteria
2,569.0,1.0,Actinobacteria
3,115.0,1.0,Bacteroidetes
4,433.0,2.0,Firmicutes
5,1130.0,2.0,Proteobacteria
6,754.0,2.0,
7,555.0,,Actinobacteria
8,,2.0,Bacteroidetes
9,687.0,2.0,


In [33]:
data.replace({None: np.nan})

Unnamed: 0,value,patient,phylum
0,632.0,1.0,Firmicutes
1,1638.0,1.0,Proteobacteria
2,569.0,1.0,Actinobacteria
3,115.0,1.0,Bacteroidetes
4,433.0,2.0,Firmicutes
5,1130.0,2.0,Proteobacteria
6,754.0,2.0,
7,555.0,,Actinobacteria
8,,2.0,Bacteroidetes
9,687.0,2.0,


In [34]:
data

Unnamed: 0,value,patient,phylum
0,632.0,1.0,Firmicutes
1,1638.0,1.0,Proteobacteria
2,569.0,1.0,Actinobacteria
3,115.0,1.0,Bacteroidetes
4,433.0,2.0,Firmicutes
5,1130.0,2.0,Proteobacteria
6,754.0,2.0,
7,555.0,,Actinobacteria
8,,2.0,Bacteroidetes
9,687.0,2.0,


In [35]:
data = data.replace({None: np.nan})
data
#Inplace=True

Unnamed: 0,value,patient,phylum
0,632.0,1.0,Firmicutes
1,1638.0,1.0,Proteobacteria
2,569.0,1.0,Actinobacteria
3,115.0,1.0,Bacteroidetes
4,433.0,2.0,Firmicutes
5,1130.0,2.0,Proteobacteria
6,754.0,2.0,
7,555.0,,Actinobacteria
8,,2.0,Bacteroidetes
9,687.0,2.0,


In [36]:
data.replace({None: np.nan}, inplace = True)
data

Unnamed: 0,value,patient,phylum
0,632.0,1.0,Firmicutes
1,1638.0,1.0,Proteobacteria
2,569.0,1.0,Actinobacteria
3,115.0,1.0,Bacteroidetes
4,433.0,2.0,Firmicutes
5,1130.0,2.0,Proteobacteria
6,754.0,2.0,
7,555.0,,Actinobacteria
8,,2.0,Bacteroidetes
9,687.0,2.0,


In [37]:
data.isnull().sum()

value      1
patient    1
phylum     2
dtype: int64

In [38]:
data.value_counts()

value   patient  phylum        
115.0   1.0      Bacteroidetes     1
433.0   2.0      Firmicutes        1
569.0   1.0      Actinobacteria    1
632.0   1.0      Firmicutes        1
1130.0  2.0      Proteobacteria    1
1638.0  1.0      Proteobacteria    1
Name: count, dtype: int64

In [39]:
data['phylum'].value_counts()

phylum
Firmicutes        2
Proteobacteria    2
Actinobacteria    2
Bacteroidetes     2
Name: count, dtype: int64

##### The `map()` function is used to map values in a Series to a new set of values, based on a dictionary, a function, or a Series.

In [40]:
#Map works on Series only (not DataFrame).
mp = {"Actinobacteria":0, "Bacteroidetes":1,"Firmicutes":2, "Proteobacteria":3 }
data["phylum"]= data["phylum"].map(mp)
data["phylum"]

0    2.0
1    3.0
2    0.0
3    1.0
4    2.0
5    3.0
6    NaN
7    0.0
8    1.0
9    NaN
Name: phylum, dtype: float64

In [39]:
data

Unnamed: 0,value,patient,phylum
0,632.0,1.0,2.0
1,1638.0,1.0,3.0
2,569.0,1.0,0.0
3,115.0,1.0,1.0
4,433.0,2.0,2.0
5,1130.0,2.0,3.0
6,754.0,2.0,
7,555.0,,0.0
8,,2.0,1.0
9,687.0,2.0,


In [41]:
#If there is no mapping, the value is replaced by NaN.
data1 = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555, None,687],
                     'patient':[1, 1, 1, 1, 2, 2, 2, None, 2,2], #[1]*4+[2]*4
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria',None, 'Actinobacteria', 'Bacteroidetes',None]})

mp1 = {"Bacteroidetes":1,"Firmicutes":2, "Proteobacteria":3 }
data1["phylum"] = data1["phylum"].map(mp1)
data1["phylum"]

0    2.0
1    3.0
2    NaN
3    1.0
4    2.0
5    3.0
6    NaN
7    NaN
8    1.0
9    NaN
Name: phylum, dtype: float64

In [42]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1]*4+[2]*4,
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

Unnamed: 0,value,patient,phylum
0,632,1,Firmicutes
1,1638,1,Proteobacteria
2,569,1,Actinobacteria
3,115,1,Bacteroidetes
4,433,2,Firmicutes
5,1130,2,Proteobacteria
6,754,2,Actinobacteria
7,555,2,Bacteroidetes


In [43]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
cols = data.columns.tolist()
for col in cols:
    if data[col].dtype =='O':
        l = le.fit_transform(data[col])
        data[col]=l

In [44]:
data

Unnamed: 0,value,patient,phylum
0,632,1,2
1,1638,1,3
2,569,1,0
3,115,1,1
4,433,2,2
5,1130,2,3
6,754,2,0
7,555,2,1


In [45]:
data.tail()

Unnamed: 0,value,patient,phylum
3,115,1,1
4,433,2,2
5,1130,2,3
6,754,2,0
7,555,2,1


In [46]:
data.head()

Unnamed: 0,value,patient,phylum
0,632,1,2
1,1638,1,3
2,569,1,0
3,115,1,1
4,433,2,2


In [47]:
data.head(10)

Unnamed: 0,value,patient,phylum
0,632,1,2
1,1638,1,3
2,569,1,0
3,115,1,1
4,433,2,2
5,1130,2,3
6,754,2,0
7,555,2,1


In [48]:
data.select_dtypes(include = 'int') #upper o (O),not zero (0)

Unnamed: 0,value,patient,phylum
0,632,1,2
1,1638,1,3
2,569,1,0
3,115,1,1
4,433,2,2
5,1130,2,3
6,754,2,0
7,555,2,1


In [50]:
data.isnull().sum()

value      0
patient    0
phylum     0
dtype: int64

In [51]:
data.isna().sum() #isnull #isna

value      0
patient    0
phylum     0
dtype: int64

In [52]:
data.apply(lambda x:x.isnull().sum())

value      0
patient    0
phylum     0
dtype: int64

In [53]:
data['value'].isnull().sum()

0

If we wish to access columns, we can do so either by `dict-like indexing(using '[]')` or `by attribute (using '.')`:

In [54]:
data['value'].values.tolist()

[632, 1638, 569, 115, 433, 1130, 754, 555]

In [55]:
data.patient.values.tolist()

[1, 1, 1, 1, 2, 2, 2, 2]

In [56]:
data['value'].unique()

array([ 632, 1638,  569,  115,  433, 1130,  754,  555])

In [57]:
data['value'].value_counts()

value
632     1
1638    1
569     1
115     1
433     1
1130    1
754     1
555     1
Name: count, dtype: int64

In [58]:
data.columns

Index(['value', 'patient', 'phylum'], dtype='object')

In [64]:
data.rename(columns = {"value":"value new"}, inplace = True)

In [59]:
data.rename(columns = {"patient":"patient new"}, inplace = True)

In [60]:
data1 = data.rename(columns = {"phylum":"phylum new"}, inplace = False)

In [61]:
data1

Unnamed: 0,value,patient new,phylum new
0,632,1,2
1,1638,1,3
2,569,1,0
3,115,1,1
4,433,2,2
5,1130,2,3
6,754,2,0
7,555,2,1


In [62]:
data

Unnamed: 0,value,patient new,phylum
0,632,1,2
1,1638,1,3
2,569,1,0
3,115,1,1
4,433,2,2
5,1130,2,3
6,754,2,0
7,555,2,1


In [65]:
data["value new"]

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value new, dtype: int64

In [66]:
data[['value new']]

Unnamed: 0,value new
0,632
1,1638
2,569
3,115
4,433
5,1130
6,754
7,555


In [67]:
type(data[['value new']])

pandas.core.frame.DataFrame

Notice this is different than with `Series`, where dict-like indexing retrieved a particular element (row). If we want access to a row in a `DataFrame`, we index its `ix` (or `loc`) attribute.


In [68]:
data

Unnamed: 0,value new,patient new,phylum
0,632,1,2
1,1638,1,3
2,569,1,0
3,115,1,1
4,433,2,2
5,1130,2,3
6,754,2,0
7,555,2,1


In [69]:
data.iloc[3, 0]

115

In [70]:
data.loc[3, 'value new']

115

Alternatively, we can create a `DataFrame` with a dict of dicts:

In [71]:
data = pd.DataFrame({0: {'patient': 1, 'phylum':'Firmicutes', 'value': 632},
                    1: {'patient': 1, 'phylum': 'Proteobacteria', 'value': 1638},
                    2: {'patient': 1, 'phylum': 'Actinobacteria', 'value': 569},
                    3: {'patient': 1, 'phylum': 'Bacteroidetes', 'value': 115},
                    4: {'patient': 2, 'phylum': 'Firmicutes', 'value': 433},
                    5: {'patient': 2, 'phylum': 'Proteobacteria', 'value': 1130},
                    6: {'patient': 2, 'phylum': 'Actinobacteria', 'value': 754},
                    7: {'patient': 2, 'phylum': 'Bacteroidetes', 'value': 555}})

In [72]:
data

Unnamed: 0,0,1,2,3,4,5,6,7
patient,1,1,1,1,2,2,2,2
phylum,Firmicutes,Proteobacteria,Actinobacteria,Bacteroidetes,Firmicutes,Proteobacteria,Actinobacteria,Bacteroidetes
value,632,1638,569,115,433,1130,754,555


We probably want this transposed:

In [73]:
data = data.T
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,1130
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


Its important to note that the Series returned when a DataFrame is indexed is merely a **view** on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:

In [74]:
vals = data['value']
vals

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: object

In [75]:
vals[5] = 0
vals

0     632
1    1638
2     569
3     115
4     433
5       0
6     754
7     555
Name: value, dtype: object

In [76]:
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [77]:
vals = data.value.copy()
vals[5] = 1000
vals

0     632
1    1638
2     569
3     115
4     433
5    1000
6     754
7     555
Name: value, dtype: object

In [78]:
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


We can create or modify columns by assignment:

In [79]:
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [80]:
l = data[:]
l

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,433
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [81]:
l.loc[2,'value']

569

In [83]:
l['value']==433

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
Name: value, dtype: bool

In [82]:
l.loc[l['value']==433,'value'] = 29
l.head()

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,29


In [81]:
l.loc[l['value']==433, 2] = 29
l.head()

Unnamed: 0,patient,phylum,value,2
0,1,Firmicutes,632,
1,1,Proteobacteria,1638,
2,1,Actinobacteria,569,
3,1,Bacteroidetes,115,
4,2,Firmicutes,29,


In [84]:
l.iloc[l['value']==29, 2] = 290
l.head()

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,290


In [85]:
data.head()

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,115
4,2,Firmicutes,290


In [86]:
data.loc[data['value']==115,'value'] = 29
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,29
4,2,Firmicutes,290
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [87]:
l

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,29
4,2,Firmicutes,290
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [90]:
dd = data.copy()
dd.loc[dd['value']==632,'value'] = 600
dd.head()

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,600
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,29
4,2,Firmicutes,290


In [91]:
data.head()

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,29
4,2,Firmicutes,290


In [92]:
data.iloc[3, 2]=28
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,28
4,2,Firmicutes,290
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [93]:
data[data["value"]==1638]

Unnamed: 0,patient,phylum,value
1,1,Proteobacteria,1638


In [94]:
data[(data["value"]==569) & (data["patient"]==1)]

Unnamed: 0,patient,phylum,value
2,1,Actinobacteria,569


In [95]:
data.columns.get_loc('value')

2

In [96]:
data.iloc[2, data.columns.get_loc('value')]

569

In [92]:
# data.iloc[1, 2]=632
data[data["value"]==632]

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632


In [97]:
row_index = data.index.get_loc(data[data["value"]==555].index[0])
print(row_index)

7


In [98]:
data['value'][3] = 14
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,14
4,2,Firmicutes,290
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [95]:
data.loc[3, 'value'] = 30
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,30
4,2,Firmicutes,290
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [96]:
data.iloc[data["value"]==30, 2] = 40
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,40
4,2,Firmicutes,290
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [97]:
data.shape

(8, 3)

In [98]:
data['month'] = [2]*2 + [1]*2 + [4]*4
data['year'] = [2013]*4 + [2014]*4
data

Unnamed: 0,patient,phylum,value,month,year
0,1,Firmicutes,632,2,2013
1,1,Proteobacteria,1638,2,2013
2,1,Actinobacteria,569,1,2013
3,1,Bacteroidetes,40,1,2013
4,2,Firmicutes,290,4,2014
5,2,Proteobacteria,0,4,2014
6,2,Actinobacteria,754,4,2014
7,2,Bacteroidetes,555,4,2014


But note, we cannot use the attribute indexing method to add a new column:

In [99]:
data.treatment = 1
data

Unnamed: 0,patient,phylum,value
0,1,Firmicutes,632
1,1,Proteobacteria,1638
2,1,Actinobacteria,569
3,1,Bacteroidetes,14
4,2,Firmicutes,290
5,2,Proteobacteria,0
6,2,Actinobacteria,754
7,2,Bacteroidetes,555


In [100]:
data["treatment"] = data.treatment
data["treatment1"] = 0
data

Unnamed: 0,patient,phylum,value,treatment,treatment1
0,1,Firmicutes,632,1,0
1,1,Proteobacteria,1638,1,0
2,1,Actinobacteria,569,1,0
3,1,Bacteroidetes,14,1,0
4,2,Firmicutes,290,1,0
5,2,Proteobacteria,0,1,0
6,2,Actinobacteria,754,1,0
7,2,Bacteroidetes,555,1,0


Specifying a `Series` as a new columns cause its values to be added according to the `DataFrame`'s index:

In [101]:
tr = pd.Series([2]*4+[1]*3)
data["tr"] = tr

In [102]:
data

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr
0,1,Firmicutes,632,1,0,2.0
1,1,Proteobacteria,1638,1,0,2.0
2,1,Actinobacteria,569,1,0,2.0
3,1,Bacteroidetes,14,1,0,2.0
4,2,Firmicutes,290,1,0,1.0
5,2,Proteobacteria,0,1,0,1.0
6,2,Actinobacteria,754,1,0,1.0
7,2,Bacteroidetes,555,1,0,


In [103]:
treatment = pd.Series([0]*4 + [1]*2)
treatment

0    0
1    0
2    0
3    0
4    1
5    1
dtype: int64

In [104]:
data['treatment'] = treatment
data

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr
0,1,Firmicutes,632,0.0,0,2.0
1,1,Proteobacteria,1638,0.0,0,2.0
2,1,Actinobacteria,569,0.0,0,2.0
3,1,Bacteroidetes,14,0.0,0,2.0
4,2,Firmicutes,290,1.0,0,1.0
5,2,Proteobacteria,0,1.0,0,1.0
6,2,Actinobacteria,754,,0,1.0
7,2,Bacteroidetes,555,,0,


Other Python data structures (ones without an index) need to be the same length as the `DataFrame`:

In [105]:
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'June', 'July', 'Aug']
data['month'] = month

In [106]:
len(data)

8

In [107]:
data['Quarter'] = ['all']*len(data)
data

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,Firmicutes,632,0.0,0,2.0,Jan,all
1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
2,1,Actinobacteria,569,0.0,0,2.0,Mar,all
3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
4,2,Firmicutes,290,1.0,0,1.0,May,all
5,2,Proteobacteria,0,1.0,0,1.0,June,all
6,2,Actinobacteria,754,,0,1.0,July,all
7,2,Bacteroidetes,555,,0,,Aug,all


In [108]:
data['Quarter1'] = 'all'
data

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter,Quarter1
0,1,Firmicutes,632,0.0,0,2.0,Jan,all,all
1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all,all
2,1,Actinobacteria,569,0.0,0,2.0,Mar,all,all
3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all,all
4,2,Firmicutes,290,1.0,0,1.0,May,all,all
5,2,Proteobacteria,0,1.0,0,1.0,June,all,all
6,2,Actinobacteria,754,,0,1.0,July,all,all
7,2,Bacteroidetes,555,,0,,Aug,all,all


We can use `del` to remove columns, in the same way `dict` entries can be removed:

In [110]:
del data['Quarter1']

KeyError: 'Quarter1'

In [111]:
data.head()

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,Firmicutes,632,0.0,0,2.0,Jan,all
1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
2,1,Actinobacteria,569,0.0,0,2.0,Mar,all
3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
4,2,Firmicutes,290,1.0,0,1.0,May,all


In [112]:
data.index

Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')

In [117]:
a = data[data["phylum"]=='Firmicutes'].index
a

Index([0, 4], dtype='int64')

In [118]:
data[data["phylum"]=='Firmicutes']

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,Firmicutes,632,0.0,0,2.0,Jan,all
4,2,Firmicutes,290,1.0,0,1.0,May,all


In [119]:
d = data.drop(a)
d.reset_index(drop = True, inplace = True)
d

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
1,1,Actinobacteria,569,0.0,0,2.0,Mar,all
2,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
3,2,Proteobacteria,0,1.0,0,1.0,June,all
4,2,Actinobacteria,754,,0,1.0,July,all
5,2,Bacteroidetes,555,,0,,Aug,all


In [120]:
d.drop('tr', axis =1)
#d.drop('tr', axis =1, inplace=True)

Unnamed: 0,patient,phylum,value,treatment,treatment1,month,Quarter
0,1,Proteobacteria,1638,0.0,0,Feb,all
1,1,Actinobacteria,569,0.0,0,Mar,all
2,1,Bacteroidetes,14,0.0,0,Apr,all
3,2,Proteobacteria,0,1.0,0,June,all
4,2,Actinobacteria,754,,0,July,all
5,2,Bacteroidetes,555,,0,Aug,all


In [121]:
d

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
1,1,Actinobacteria,569,0.0,0,2.0,Mar,all
2,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
3,2,Proteobacteria,0,1.0,0,1.0,June,all
4,2,Actinobacteria,754,,0,1.0,July,all
5,2,Bacteroidetes,555,,0,,Aug,all


In [117]:
data

Unnamed: 0,patient,phylum,value,month,year,treatment,treatment1,tr,Quarter
0,1,Firmicutes,632,Jan,2013,0.0,0,2.0,all
1,1,Proteobacteria,1638,Feb,2013,0.0,0,2.0,all
2,1,Actinobacteria,569,Mar,2013,0.0,0,2.0,all
3,1,Bacteroidetes,40,Apr,2013,0.0,0,2.0,all
4,2,Firmicutes,290,May,2014,1.0,0,1.0,all
5,2,Proteobacteria,0,June,2014,1.0,0,1.0,all
6,2,Actinobacteria,754,July,2014,,0,1.0,all
7,2,Bacteroidetes,555,Aug,2014,,0,,all


In [122]:
data["phylum"].drop([0])

1    Proteobacteria
2    Actinobacteria
3     Bacteroidetes
4        Firmicutes
5    Proteobacteria
6    Actinobacteria
7     Bacteroidetes
Name: phylum, dtype: object

In [123]:
data1 = data.copy()
data1["phylum"] = data1["phylum"].drop([0], inplace = True)
data1

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,,632,0.0,0,2.0,Jan,all
1,1,,1638,0.0,0,2.0,Feb,all
2,1,,569,0.0,0,2.0,Mar,all
3,1,,14,0.0,0,2.0,Apr,all
4,2,,290,1.0,0,1.0,May,all
5,2,,0,1.0,0,1.0,June,all
6,2,,754,,0,1.0,July,all
7,2,,555,,0,,Aug,all


In [124]:
data.loc[0, "phylum"] = None
data

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,,632,0.0,0,2.0,Jan,all
1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
2,1,Actinobacteria,569,0.0,0,2.0,Mar,all
3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
4,2,Firmicutes,290,1.0,0,1.0,May,all
5,2,Proteobacteria,0,1.0,0,1.0,June,all
6,2,Actinobacteria,754,,0,1.0,July,all
7,2,Bacteroidetes,555,,0,,Aug,all


In [125]:
data["Jan"] = ["Jan"]*len(data)

In [126]:
data.head()

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter,Jan
0,1,,632,0.0,0,2.0,Jan,all,Jan
1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all,Jan
2,1,Actinobacteria,569,0.0,0,2.0,Mar,all,Jan
3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all,Jan
4,2,Firmicutes,290,1.0,0,1.0,May,all,Jan


In [127]:
del data['Jan']
data

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,,632,0.0,0,2.0,Jan,all
1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
2,1,Actinobacteria,569,0.0,0,2.0,Mar,all
3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
4,2,Firmicutes,290,1.0,0,1.0,May,all
5,2,Proteobacteria,0,1.0,0,1.0,June,all
6,2,Actinobacteria,754,,0,1.0,July,all
7,2,Bacteroidetes,555,,0,,Aug,all


We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:

In [128]:
data.values

array([[1, None, 632, 0.0, 0, 2.0, 'Jan', 'all'],
       [1, 'Proteobacteria', 1638, 0.0, 0, 2.0, 'Feb', 'all'],
       [1, 'Actinobacteria', 569, 0.0, 0, 2.0, 'Mar', 'all'],
       [1, 'Bacteroidetes', 14, 0.0, 0, 2.0, 'Apr', 'all'],
       [2, 'Firmicutes', 290, 1.0, 0, 1.0, 'May', 'all'],
       [2, 'Proteobacteria', 0, 1.0, 0, 1.0, 'June', 'all'],
       [2, 'Actinobacteria', 754, nan, 0, 1.0, 'July', 'all'],
       [2, 'Bacteroidetes', 555, nan, 0, nan, 'Aug', 'all']], dtype=object)

In [129]:
data

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,,632,0.0,0,2.0,Jan,all
1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
2,1,Actinobacteria,569,0.0,0,2.0,Mar,all
3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
4,2,Firmicutes,290,1.0,0,1.0,May,all
5,2,Proteobacteria,0,1.0,0,1.0,June,all
6,2,Actinobacteria,754,,0,1.0,July,all
7,2,Bacteroidetes,555,,0,,Aug,all


In [130]:
data.dtypes

patient        object
phylum         object
value          object
treatment     float64
treatment1      int64
tr            float64
month          object
Quarter        object
dtype: object

Notice that because of the mix of string and integer (and `NaN`) values, the dtype of the array is `object`. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.

Pandas uses a custom data structure to represent the indices of Series and DataFrames.

In [131]:
data.index

Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')

Index objects are immutable:

In [132]:
data.index[0] = 15

TypeError: Index does not support mutable operations

In [136]:
data = data.set_index("month")

In [133]:
data

Unnamed: 0,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,1,,632,0.0,0,2.0,Jan,all
1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
2,1,Actinobacteria,569,0.0,0,2.0,Mar,all
3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
4,2,Firmicutes,290,1.0,0,1.0,May,all
5,2,Proteobacteria,0,1.0,0,1.0,June,all
6,2,Actinobacteria,754,,0,1.0,July,all
7,2,Bacteroidetes,555,,0,,Aug,all


In [134]:
data.reset_index()

Unnamed: 0,index,patient,phylum,value,treatment,treatment1,tr,month,Quarter
0,0,1,,632,0.0,0,2.0,Jan,all
1,1,1,Proteobacteria,1638,0.0,0,2.0,Feb,all
2,2,1,Actinobacteria,569,0.0,0,2.0,Mar,all
3,3,1,Bacteroidetes,14,0.0,0,2.0,Apr,all
4,4,2,Firmicutes,290,1.0,0,1.0,May,all
5,5,2,Proteobacteria,0,1.0,0,1.0,June,all
6,6,2,Actinobacteria,754,,0,1.0,July,all
7,7,2,Bacteroidetes,555,,0,,Aug,all


This is so that Index objects can be shared between data structures without fear that they will be changed.