# <center> Introduction to Data Analysis. Pandas</center>

### Different types of data files to work with

A file format is a standard way in which information is encoded for storage in a file. Firstly, the file format specifies whether the file is a binary file or not (ASCII). Secondly, it shows how the information is organized. 

The most popular data file formats:

* **CSV** (Comma-Separated List). The simplest and best-supported file type for tabular data. It's more compact than other file formats. 
```python
id,type,quantity
0,bananas,12
1,apples,7
```
* **JSON** (Java Script Object Notation). It's commonly used for storing ang exchanging the data (for example, in API). While CSV is the most common file format for “flat” data, JSON is the most common file format for “tree-like” data that potentially has multiple layers, like the branches on a tree:
```python
{[{'id': 0, 'type': 'bananas', 'quantity': 12}, {'id': 1, 'type': 'apples', 'quantity': 7}]}
```
* **XLSX**. Microsoft Excel Open XML file format. It also comes under the Spreadsheet file format.
* **HTML** (e.g., for web pages)
* **ZIP** (compressed archives). It's used to collect multiple data files together into a single file. This is done for simply compressing the files to use less storage space.
* **PDF** (Portable Document Format). 
* **TXT** (plain text files)
* Images, audio and video files in various formats

etc. [More info](https://www.analyticsvidhya.com/blog/2017/03/read-commonly-used-formats-using-python/)

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/03/01103613/CSV.png'>

To identify a file format, you can usually look at the file extension to get an idea.  For example, "data.csv". Choosing the optimal file format for storing data can improve the performance of your models in data processing.

### Reading and writing a file using 'open'

In [32]:
# create an empty file
f = open('new_file', 'x')
f.close()

In [33]:
# create a new file if it doesn't exist
f = open('new_file', 'w')
f.close()

In [34]:
# write to a new file
f = open('new_file', 'w')
f.write('Let it be the first line.')
f.write('This is the second line.')
f.write('The end.')
f.close()

In [35]:
# read from a file
f = open('new_file', 'r')
f.read()

'Let it be the first line.This is the second line.The end.'

In [36]:
# read X symbols from a file
f = open('new_file', 'r')
f.read(3)

'Let'

In [37]:
# read X symbols from a file
f = open('new_file', 'r')
f.readline()

'Let it be the first line.This is the second line.The end.'

In [38]:
f.readline()

''

In [39]:
f.close()

In [40]:
f = open('new_file', 'a')
f.write('Now it contains more content.')
f.close()
f = open('new_file', 'r')
f.read()

'Let it be the first line.This is the second line.The end.Now it contains more content.'

More used way in order to drope close() step:

In [41]:
with open('some_file.txt', 'w') as f:
    f.write('I want \n to test \n this text.')  

In [48]:
with open('some_file.txt', 'r') as f:
    output = f.read()
    
output

'I want \n to test \n this text.'

**Problem!** What if the file you need is in a different directory (folder)?

### Introduction to pandas library

<img src="https://welovepandas.club/wp-content/uploads/2019/02/panda-bamboo1550035127.jpg" height=350 width=400>

pandas (short for Python Data Analysis Library)

In [50]:
import pandas as pd

Sometimes you can easily open the file with any additional arguments.

In [81]:
df = pd.read_csv('happiness_index_2019.csv')
df.head()

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.34,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.6,1.383,1.573,0.996,0.592,0.252,0.41
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.38,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298


Sometimes, it's treaky. 

This data set is too big for github, download it from [here](https://www.kaggle.com/START-UMD/gtd). You will need to register on Kaggle first.

In [132]:
df = pd.read_csv('globalterrorismdb_0718dist.csv', encoding='ISO-8859-1')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In pandas you need to work with DataFrames and Series.

In [68]:
pd.Series([5, 6, 7, 8, 9, 10])

0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64

In [71]:
pd.DataFrame([1, 2, 3, 4, 5])

Unnamed: 0,0
1,1
2,2
3,3
4,4
5,5


In [73]:
pd.DataFrame({'Student': ['1', '2'], 'Name': ['Alice', 'Michael'], 'Surname': ['Brown', 'Williams']})

Unnamed: 0,Student,Name,Surname
0,1,Alice,Brown
1,2,Michael,Williams


In [74]:
pd.DataFrame([{'Student': '1', 'Name': 'Alice', 'Surname': 'Brown'}, 
            {'Student': '2', 'Name': 'Anna', 'Surname': 'White'}])

Unnamed: 0,Student,Name,Surname
0,1,Alice,Brown
1,2,Anna,White


Check how to create it:
* pd.DataFrame().from_records()
* pd.DataFrame().from_dict()

Let's explore the second set of data. How many rows and columns are there?

General information on this data set:

How to look only at the column names:

How to look at the first 10 lines?

How to look at the last 15 lines?

Which data types we have in each column?

Delete a column ```eventid``` from this data set, because we don't know the description of it:

In [95]:
df.drop(['eventid'], axis=1, inplace=True)

Rename a column ```iyear``` to ```Year```:

In [97]:
df.rename({'iyear' : 'Year'}, axis='columns', inplace=True)

How to check the missing values?

How to drop all missing values?

In [None]:
df.dropna(inplace=True)

Use a function to replace nans to a word 'None' in 'approxdate' column