In [1]:
import pandas as pd

In [12]:
titanic_excel = pd.read_excel("titanic.xls")

In [13]:
help(titanic_excel.to_csv)

Help on method to_csv in module pandas.core.generic:

to_csv(path_or_buf: 'FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None' = None, sep: 'str' = ',', na_rep: 'str' = '', float_format: 'str | Callable | None' = None, columns: 'Sequence[Hashable] | None' = None, header: 'bool_t | list[str]' = True, index: 'bool_t' = True, index_label: 'IndexLabel | None' = None, mode: 'str' = 'w', encoding: 'str | None' = None, compression: 'CompressionOptions' = 'infer', quoting: 'int | None' = None, quotechar: 'str' = '"', lineterminator: 'str | None' = None, chunksize: 'int | None' = None, date_format: 'str | None' = None, doublequote: 'bool_t' = True, escapechar: 'str | None' = None, decimal: 'str' = '.', errors: 'OpenFileErrors' = 'strict', storage_options: 'StorageOptions | None' = None) -> 'str | None' method of pandas.core.frame.DataFrame instance
    Write object to a comma-separated values (csv) file.

    Parameters
    ----------
    path_or_buf : str, path object, file-like object, o

In [14]:
# writing data to to the csv file, 
# creating csv file if not exists.

with open("titanic.csv", "w+") as f:
    f.write(titanic_excel.to_csv(index=False))

# How to I select a subset of a `DataFrame`?

### How do I select specific columns from a `DataFrame`?

#### I'm interestd in the age of the Titanic passengers.

In [15]:
ages = titanic_excel['age']

In [16]:
ages

0       29.0000
1        0.9167
2        2.0000
3       30.0000
4       25.0000
         ...   
1304    14.5000
1305        NaN
1306    26.5000
1307    27.0000
1308    29.0000
Name: age, Length: 1309, dtype: float64

In [17]:
ages.head()

0    29.0000
1     0.9167
2     2.0000
3    30.0000
4    25.0000
Name: age, dtype: float64

To select a single column, use square brackets `[]` with the column name of the column of interest.

Each column in a `DataFrame` is a `Series`. As a single column is selected, the returned object is a pandas `Series`. We can verify this by checking the type of the output:

In [19]:
type(titanic["age"])

pandas.core.series.Series

to check the shape of the output

In [20]:
titanic['age'].shape

(1309,)

A pandas Series is 1-dimensional and only the number of rows is returned.

#### To get the *age* and *sex* of the Titanic passengers:

In [22]:
age_sex = titanic[["age", "sex"]]
age_sex.head()

Unnamed: 0,age,sex
0,29.0,female
1,0.9167,male
2,2.0,female
3,30.0,male
4,25.0,female


To select multiple columns, use a list of column names within the selection brackets `[]`.

The returned data type is a pandas DataFrame:

In [23]:
type(titanic[["age", "sex"]])

pandas.core.frame.DataFrame

In [24]:
titanic[["age", "sex"]].shape

(1309, 2)

The selection returned a `DataFrame` with 1309 rows and 2 columns. A `DataFrame` is 2-dimensional with both a row and column dimension.

### How to I filter specific rows from a `DataFrame`?

#### To get the passengers older than 35 years.

In [25]:
above_35 = titanic[titanic['age']>35]

In [27]:
above_35.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1,0,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,,,"Belfast, NI"
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


To select rows based on a conditional expression, use a condition inside the selection brackets `[]`.

The condition inside the selection brackets `titanic['age'] > 35` checks for which rows the `age` column has a value larger than 35:

In [28]:
titanic['age'] > 35

0       False
1       False
2       False
3       False
4       False
        ...  
1304    False
1305    False
1306    False
1307    False
1308    False
Name: age, Length: 1309, dtype: bool

The output of the conditional express (`>`, but also `==`, `!=`, `<`, `<=`, ...would work) is actually a pandas `Series` of boolean values (either `True` or `False`) with the same number of rows as the original `DataFrame`. Such a `Series` of boolean values can be used to filter the `DataFrame` by putting it in between the selection brackets `[]`. Only rows for which the value is `True` will be selected.

Let's have a look at the number of rows which satisfy the condition by checking the `shape` attribute of the resulting `DataFrame` `above_35`:

In [29]:
above_35.shape

(322, 14)

#### Titanic passsengers from cabin class 2 and 3.

In [31]:
class_23 = titanic[titanic['pclass'].isin([2, 3])]

In [33]:
class_23.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
323,2,0,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0,,C,,,"Russia New York, NY"
324,2,1,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0,,C,10.0,,"Russia New York, NY"
325,2,0,"Aldworth, Mr. Charles Augustus",male,30.0,0,0,248744,13.0,,S,,,"Bryn Mawr, PA, USA"
326,2,0,"Andrew, Mr. Edgardo Samuel",male,18.0,0,0,231945,11.5,,S,,,"Buenos Aires, Argentina / New Jersey, NJ"
327,2,0,"Andrew, Mr. Frank Thomas",male,25.0,0,0,C.A. 34050,10.5,,S,,,"Cornwall, England Houghton, MI"


Similar to the conditional expression, the `isin()` conditional function returns a `True` for each row the values are in the provided list. 

To filter the rows based on such a function, use the conditional function inside the selection brackets `[]`.

In this case, the condition inside the selection brackets `titanic['pclass'].isin([2,3])` checks for which rows the `pclass` column is either 2 or 3.

The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an `|` (or) operator:

In [36]:
class_23 = titanic[
    (titanic['pclass'] == 2) | (titanic['pclass'] == 3)
]

In [37]:
class_23.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
323,2,0,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0,,C,,,"Russia New York, NY"
324,2,1,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0,,C,10.0,,"Russia New York, NY"
325,2,0,"Aldworth, Mr. Charles Augustus",male,30.0,0,0,248744,13.0,,S,,,"Bryn Mawr, PA, USA"
326,2,0,"Andrew, Mr. Edgardo Samuel",male,18.0,0,0,231945,11.5,,S,,,"Buenos Aires, Argentina / New Jersey, NJ"
327,2,0,"Andrew, Mr. Frank Thomas",male,25.0,0,0,C.A. 34050,10.5,,S,,,"Cornwall, England Houghton, MI"


Note

------------

When combining multiple conditional statements, each condition must be surrounded by parentheses `()`. Moreover, you can not use `or`/`and` but need to use the `or` operator `|` and the `and` operator `&`.

#### To work with passenger data for which the age is known.

In [38]:
age_no_na = titanic[titanic['age'].notna()]

In [39]:
age_no_na.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


The `notna()` conditional function returns a `True` for each row the values are not a `Null` value. As such, this can be combined with the selection brackets `[]` to filter the data table.

To check the shape of the filter data

In [42]:
# 
age_no_na.shape

(1046, 14)

### How do I select specific rows and columns from a `DataFrame`?

#### I'm interested in the names of the passengers older than 35 years.

In [43]:
adult_names = titanic.loc[titanic["age"]>35,"name"]

In [45]:
adult_names.head()

5                              Anderson, Mr. Harry
6                Andrews, Miss. Kornelia Theodosia
7                           Andrews, Mr. Thomas Jr
8    Appleton, Mrs. Edward Dale (Charlotte Lamson)
9                          Artagaveytia, Mr. Ramon
Name: name, dtype: object

In this case, a subset of both rows and columns is made in one go and just using selection brackets `[]` is not sufficient anymore.

The `loc`/`iloc` operators are required in front of the selection brackets `[]`.

When using `loc`/`iloc`, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

When using the column names, row labels or a condition expression, use the `loc` operator in front of the selection brackets `[]`. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.

#### I'm interested in rows 10 till 25 and columns 3 to 5.

In [46]:
titanic.iloc[9:25, 2:5]

Unnamed: 0,name,sex,age
9,"Artagaveytia, Mr. Ramon",male,71.0
10,"Astor, Col. John Jacob",male,47.0
11,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",female,18.0
12,"Aubart, Mme. Leontine Pauline",female,24.0
13,"Barber, Miss. Ellen ""Nellie""",female,26.0
14,"Barkworth, Mr. Algernon Henry Wilson",male,80.0
15,"Baumann, Mr. John D",male,
16,"Baxter, Mr. Quigg Edmond",male,24.0
17,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0
18,"Bazzani, Miss. Albina",female,32.0


Again, a subset of both rows and columns is made in one go and just using selection brackets `[]` is not sufficient anymore. 

When specifically interested in certain rows and/or columns based on their position in the table, use the `iloc` operator in front of the selection brackets `[]`.

When selecting specific rows and/or columns with `loc` or `iloc`, new values can be assigned to the selected data. For example, to assign the name `anonymous` to the first 3 elements of the fourth column:

In [47]:
titanic.iloc[0:3, 3] = "anonymous"

In [48]:
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",anonymous,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",anonymous,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",anonymous,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
