## Production 1: Data translation for storage

To enable a program to function each time it runs there needs to be an external and persistent data storage system that retains the state of the data. There are two considerations here. The physical storage medium for the data, such as a file or database and the format/struture of the data. This week you have explored a number of different formats in both regards as follows:

- CSV
- XML
- JSON
- MongoDB
- SQL Database

Select ONE format that you consider as most suited to the data in the scenario and the aims of the program (client brief/or your own data). The format selected should support both the nature of the data and the aims of the application being designed. It should provide distinct advantages and minimal limitations over other data formats. It should not be selected solely because it is the easiest to program, although this can be included as an advantage if applicable.

### Design

Produce a model that shows how the data needs to be restructured to take best advantage of the selected format and work more effectively within the program. Where you have created groups or objects from the data show how they relate to each other.

### Implementation

Implement a parser that reads in the original data file. You may want to create a subset of the data file for testing and speed. Your program should then perform the translation from the original format/structure into your selected format. The result of this process should then be outputted to its relevant physical medium (files/database).

At this stage there is no requirement to handle data types (other than those inherent in the data format, i.e. numbers and “Strings”), conversions or missing data. The program can be demonstrated as a simple console based application, requiring the input of the file name by the user and sufficient output to demonstrate the correctness of the translation process.

Your program should produce regular output statements to the console so that it is easy to follow what the program is doing and provides a visual demonstration of the translation process. This will also eb handy for any debugging required.

### Reflection on design decisions

Write a 200-word reflection that states the reason for your format selection and the advantages the format leads to the data and application and any limitations on the future use of this data within the selected format.

In [1]:
import json, os

In [2]:
SRC_DIR = 'datasets/'
OUT_DIR = 'outputs/'

In [43]:
# CSV parser


class CSV:
    def __init__(self, csv_file):
        self.__file = csv_file
        self.__headers = []
        self.__data = []
        self.__parse()

    def __readln(self, line):
        return line.strip().split(",")

    def __parse(self):
        with open(self.__file, "r") as f:
            for line in f:
                if not self.__headers:
                    self.__headers = self.__readln(line)
                else:
                    self.__data.append(self.__readln(line))

    @property
    def headers(self):
        return self.__headers

    @property
    def data(self):
        return self.__data

    def to_dict(self):
        return [
            {self.__headers[i]: row[i] for i in range(len(self.__headers))}
            for row in self.__data
        ]

    def to_json(self):
        return json.dumps(self.to_dict(), indent=4)

In [44]:
# main functions


def parse_csv():
    csv_filename = input("Enter the csv filename: ")
    csv_filepath = SRC_DIR + csv_filename
    if not os.path.exists(csv_filepath):
        print(f"File {csv_filepath} not found")
        return None
    print(f"Parsing CSV file {csv_filename}...")
    csv_parser = CSV(csv_filepath)
    print(f"CSV file {csv_filename} parsed successfully")
    return csv_parser


def export_json(csv_parser):
    json_filename = input("Enter the json filename: ")
    json_filepath = OUT_DIR + json_filename
    print(f"Exporting JSON file {json_filename}...")
    os.makedirs(os.path.dirname(json_filepath), exist_ok=True)
    with open(json_filepath, "w") as f:
        f.write(csv_parser.to_json())
    print(f"JSON file {json_filename} exported successfully")


def main():
    csv_parser = parse_csv()
    if csv_parser:
        print()
        export_json(csv_parser)
    else:
        print("Exited due to error")

## Production 3: Data Cleaning and Initial Analysis

Given the client brief, there are a number of requirements to provide accesses to specific parts of the data and provide answers to specific statistical questions. For this production focus on how your application will manipulate the data (cleaning and shaping) and developing functions for calculating some of the statically requirements.

### Design

Consider the steps required for cleaning and shaping and any of the calculations (functions) you want to develop. Write pseudocode to sketch these out before you write code. Mentally or on paper walk the data through your pseudocode steps to test how effective your solution is.

### Implementation

The first stage is to clean the data and make sure it is fit for purpose. Examine the data careful to identify anomalies and then consider how your program can identify these and correct/delete or change. You will need to consider how you are going to handle erroneous or missing values. You should output a sample of the data that demonstrates how cleaning has changed the data.

The next stage is to reshape the data as per any requirements of the brief (or your own scenario). Is all the data needed to provide the required results? Is any of it duplicated? Is there data across different sources that needs to be brought together? Again, output a sample to the console to demonstrate how this has changed the structure of the data.

Finally develop and test a set of functions (or objects and methods) that applies the statistical analysis to the data set, outputting the results to the console.

Capture the results of your data cleaning, shaping and functions with screenshots of the consol. There is no requirement at this stage for anything to be functioning in through the GUI. Make sure it is clear what your output is testing/demonstrating (output simple informative statements).

### Reflection on design decisions

Write a 200-word reflection that states why you have selected specific tools from NumPy or pandas. This may be the data structure you have used, the functions you have applied to clean and shape the data. Clearly identify which specific aspects of the requirements or data’s structure informed the decisions.

In [6]:
import pandas as pd

In [36]:
class DF:
    def __init__(self, csv_file):
        self.__df = pd.read_csv(csv_file)

    @property
    def columns(self):
        return self.__df.columns

    @property
    def data(self):
        return self.__df

    def remove(self, column: str, value: object) -> None:
        """Remove rows where column is equal to value"""
        self.__df = self.__df[self.__df[column] != value]

    def rename(self, column: str, new_name: str) -> None:
        """Rename a column"""
        self.__df.rename(columns={column: new_name}, inplace=True)

    def merge(self, df: "DF", on: str, how: str = 'inner') -> None:
        """Merge two dataframes"""
        self.__df = self.__df.merge(df.data, on=on, how=how)