# Working with Text Files

## Introduction to Files

In Windows, MAC or other OSs, a `file` is any item that can be created, modified, stored, or deleted by the user or the operating system. This includes image, sound or video files as well as workbooks, text documents and many more.

In contrast to Operating Systems, Python only maintains 2 categories of files:
### Text Files
- A text file is actually a sequence of lines of so-called `electronic text`. Each line on its own is structured as a sequence of characters.\
  In this, the term `text` refers to strings, numbers, symbols altogether.
- Furthermore, if a text file contains multiple lines, they are normaly terminated with a special character called `End of Line character` or `EOL`.
- Text files are indispensible due to the ease and speed at which they transfer data  to and from server.
- They don't contain metadata. It means that they don't contain any information about the data stored in them. They only contain data as text.
- In the literature and while doing online search, you can often encounter text files also as `Flat Files` or `ASCII Files` (*American Standard Code for Information Interchange*) 

### Binary Files
- This type of file is written in a binary language, also called machine language. This means that it contains nothing but `1`s and `0`s.
- Thus, due to nature of this file type, its contents can only be accessed and processed by an application that understands this structure.
- Contrary to text files, binary files do not contain line terminators. The information is stored in a continuous string of 1s and 0s only, regardless of its length. 

### Text Files Vs Text Data
**Text Files** are a certain type of file\
**Text Data**, on the other hand, is data or information which represents 'text', as opposed to numbers or Boolean. 

### How has Python gained such a great reputation for processing image files?
This happens when image files have been transformed into 'Text data' that contain the values along any of the dimensions of an RGB format. This allows the storage of images as colours in a number format.

## File vs File Object

### File
- A file is an item that can be created, modified, and stored by a user or an operating system.
- It can be a **text file** or a **binary file**. (*We will be focusing on `text files`*)

### File object

- A file object is a Python object that contains data imported from a file.
- In general we have files stored in our computer. To work with them in our programs, we have to import it and convert them into `file objects` first.
- Once the given information has been stored as a file object, we will be able to modify that object as we like, depending on the circumstances.\
(*We will be importing `text files` into Python as `file objects` and then we will manipulate these objects while cleaning, preprocessing and analyzing data.*)

## Read vs Parse

### Reading
- The goal of reading a file is simply transferring its text (content) into the computer's memory.
- When working in Python, this will primarily be your Working Memory or `RAM`.

Once the process of *Reading* the file has been completed, you will have your data stored into a Python `File Object`. Then, we *Parse* it. 

### Parsing
- Parsing is about trying to understand the purpose of the Python object.
- Is also called *Syntactic Analysis*.

A classic analogy from real life to the use of these 2 terms would be this one:
- Photocopying an image with a machine can be considered *`reading`* that image.
- Then, a person looking at the photocopied image and trying understand what's in it as well as assign meaning to its different segments is actually *`parsing`* that image.

Talking Python, merely *importing* or *loading* a text file into a Python data structure such as Series or a Dataframe is *`Reading`*.\
Then, specifying how this transfer can be done, for instance, by specifying an index of the structure is *`Parsing`*.\
Hence, *reading* and *parsing* are two seperate activities.

## Types of Data

Before you start cleaning or preprocessing information, you need to  be acquainted with the structure of data first. When you are given some data, you want to be able to obtain specific pieces of information from it. The better the structure in which the values have been stored, the easier will it be to find what you're looking for.\
According to their organisation, we can categorize as:
- **Structured**:
- **Unstructured**:
- **Semi-structured**

### Structured Data
- When you first hear the term 'Data', the image that pops up in the mind first are `Tables`, containing numeric or text values. Two very common indicators that can help you find the information you are looking for quickly and efficiently are `Rows` and `Columns`. These data structures are called `Structured Data`, sometimes called 'Traditional Data'.
- They are arranged in Tabular forms and can also be stored in Databases and can be managed form one computer.
- Its clear structure can really facilitate you access to specific data value.
- Examples of where such data can be stored in practise is an Excel spreadsheet, an SQL database or a pandas DataFrame.

### Unstructured Data
- Unstructured data refers to information that is organised in way that makes finding a specific piece of it is actually hard.
- Formats like video, audio, photos, presentations, webpages, text and more.

### Semi-structured Data
- Structures data is based on using relational databases. In contrast to that, `Semi-structures data` is more about using different patterns for storing and organizing the data for easier access and analysis.

### Big Data
- Big Data refers to extremely large, complex datasets that have been allocated on multiple computers
- They are characterized as 'Big' because of their Velocity, Volume, Variety, Veracity, Value, etc. (*also called the Vs of Big Data*)
- The categorization of data structures i.e data being structured, unstructure or semi-structured, applies to Big Data too

## Data Connectivity through Text
### File Formats:
- On a technical level, different pieces of software, or software products, communicate through text files of one form or another.
- The most widely used text file formats today are `.json`, `.xml`, `.csv`, `.txt`, `.xlsx`
- Since almost every software today uses text file to communicate with other software, it is fundamental to know what a text file appearance and functionality is.
- The usual file formats, while working with Python, will be
  - `.csv` - which is perfect for storing tabular data.
  - `.json` - read, export and worked with by almost all modern programming languages. 
  - and sometimes `.xlsx` - if you team relies on Excel for analysis. Facilitates the workflow of the team.
  - The relevant Python techniques for writing with these types of files, although being similar, differ to an extent

### Types of Text Files
**Plain Text File**
- Any text file that contains information with no formatting (text only) is referred to as plain text.
- Created by text editors such Notepad.

**Rich Text File**
- Text files containing certain formatting.
- Created by Word Processors, such as MS Word
- Rich text files are also called a 'document', that's why they have file extension such as `.doc` or `.docx`

When working in Python, we handle data from 'Plain Text' file as Rich text files are not the ideal ones.

## Organisation of Information in Text File

In order to have control over encountering certain types of potential errors while cleaning or preprocessing data, one needs to be aware of the following terms.

- Dataset is an aggregation of a certain amount of values. If all the values get thrown in a single text file with all the values being glued to each other, different softwares, or even people will not be able to distinguish between the values, hence making it unreadable. Therefore, you need specific characters to indicate where a value starts and ends as well as where the end of a line is.
- **Character**: It is any mark or symbol you can use in writing. It can be letters, digits, punctuation marks or other interesting symbols. It will normally constitute string of letters, numbers or marks.
- But, such characters are not universal. It depends on the `Character encoding specification` that defines specifications on which character can be used as an indicator for an end of value, or an end of a line, etc.
- Generally, people work with `text`. However, machines only work with `bytes`. That is, machines understand commands written in *machine language*, that is lowest level of language composing of only `1s` and `0s`, organized in a specific order.
- Thus, `Encoding` is the process of converting, or translating text information into bytes.
- **Character encoding specifications** define how computers represent text characters as numbers, with the most common modern standard being **UTF-8** (*8-bit Unicode Transformation Format*). Other specifications include **UTF-16**, which is used in Windows and Java, and **ASCII**, an older, 7-bit standard for English characters.

Choosing the right encoding is crucial for correctly displaying content across different platforms and languages. After choosing the suitable encoding specification, you need to keep in mind the following terms:
- **Separator**: a character that will separate the values in your file. This can be a comma (`,`), a semi-colon (`;`), a tab (`    `), space (` `), or something else.
- **End-of-line** character (EOL): newline character, line ending, line break can be written with a combination of characters like `\n` or `,` . This will inform the interpreter that a new line shall begin right after.
- **End-of-File** (EOF) - a signal, or condition, that there are no more characters contained in the text file.

## More on Text Files

### Plain Text Files vs Flat Files
- Both files contain non-formatted text.
- In `Plain text files`, each data value is stored in a seperate line. It has no seperators. For eg. a plain text file containing the average monthly stock price of the company shall store each monthly value on a separate line.
- A `Flat file` resembles a plain text file with the difference that the data has some type of separator to separate the values. The separators help form a structure that corresponds to a single data table, i.e. a structure corresponding to a data stored in a tabular form.

### .csv Files
The most common example of flat file is the CSV file. Since flat files contain a *separator* to separate the values. This can be comma, tab or a semi-colon.

**Comma-separated values**
Typically, CSV stands for 'comma-separated value'. It is commonly implied that the seperator used in flat file is a comma.\
But in reality, this term is so diffused that it comprises of values also separated by other characters, thus leading to the birth of another new full form of 'CSV'...

**Character-separated value**
In these flat files, instead of a comma, we can have a semi-colon or a tab.

### .dsv Files
A 'delimiter' and 'seperator' are two different things. Demiliters are not exactly intended to separate the values. Their actual function is to define distinct fields like cells in a row or rows in a table. In this way, any delimiter is a separator, but a separator may also be just a space separating two words, hence not making a delimiter all the time.

### .tsv Files
TSVs are also a type of DSV file format for storing tabular data, where values in a row are separated by tab.

At the end, it is really important to be aware of how your values have been separated/delimited and be able to properly transfer your dataset from a text file to another program such as Python.

## Relational database vs Flat File database
- **Relational database** typically contains multiple data tables that relate to each other in a certain way.
- **Flat File database** correspond to a single table from the relational database. Hence, it denotes a relational database composed of a single data table.
- Relational database is build upon the functionalities of flat files and the potential relationship between them to provide more flexibilty and consistency of the data.

## Fixed-width Files
- These are Flat files where data is arranged in columns of a pre-determined, fixed number of characters.
- Unlike delimited files, they don't use separators; instead, data is identified by its exact position and length, with shorter entries padded with spaces to fill their allotted width.
- It is a text file in which a delimiter has been included to make identical, or to fix, the fields in which data values have been stored.
- Such files are characterized by:
  - Fixed column widths
  - Pad characters to fill in the empty spaces, and
  - Left/Right alignments
- These files are rarely used nowadays, as they often tend to be problematic at the various stages of data preprocessing.

## Common Naming Conventions